1. Introduction

This document describes the computational analysis of the Chan M, Motakis E, Tan SH et al. Prioritizing Post-Myocardial Infarction Heart Failure Candidates Using Plasma Proteomics and Single-Cell Transcriptomics. Submitted to Circulation, 2020 (under review).

In this study, we employed aptamer-based affinity-capture plasma proteomics to measure 1305 plasma proteins in a New Zealand cohort (CDCS) and a Singapore (IMMACULATE) cohort. We sought to identify circulating proteins associated with an HF incident after an initial MI by performing large-scale proteomic profiling of plasma obtained from patients 30 days after their index MI. The pipeline and methodology have been extensively discussed in the submitted manuscript. Here, we will show in detail the computational steps (R script, tables and plots) that generated the main findings. The reader is free to use our code to replicate the results and to explore other aspects of our data.

A brief description of the datasets used in this study can be found at https://github.com/ArisStefanosSn/HFproteomics.

1.1. Cohorts

The primary cohort was selected from the Coronary Disease Cohort Study (CDCS, ACTRN 12605000431628), which consisted of 2140 patients hospitalized in two tertiary hospitals in New Zealand (NZ), for an acute coronary syndrome from 2002 – 2009, and followed up for a median of 5.1 years (interquartile range 3.7-6.8 years, maximum 9.5 years). Clinical data, blood samples and echocardiographic measurements were obtained at ~ 30 days, 4 months and 12 months after hospital admission. Subsequent clinical events and mortality were obtained from the NZ National Health Information System. The New Zealand Multi-region Ethics Committee approved the study (CTY/02/02/018) and all participants gave written informed consent prior to study participation. For the current study, we selected nested cases of 181 post-MI patients who had readmission for HF (HF group) and another 250 post-MI patients who remained free of HF hospitalization, adverse LV remodelling or death from a cardiovascular cause during follow-up (Control group).

The external cohort was selected from the Improving Outcomes in Myocardial Infarction through Reversal of Cardiac Remodelling (IMMACULATE) registry, which consisted of 859 patients hospitalized for MI at 3 hospitals in Singapore from 2011 to 2014 and followed up for a median of 392 days (interquartile range 361-724 days, maximum 1006 days). Clinical data, blood samples and echocardiographic measurements were obtained within 24-72 hours of admission, 30 days, 6, 12 and 24 months after hospital admission. We selected a cohort of the first 200 consecutively enrolled patients with 30-day plasma samples available and therefore had the longest follow-up period at the time of study completion. Of these 200 patients, 19 were hospitalized for HF. The institutional review board and the ethics committee at Singapore’s National Healthcare Group Domain Specific Review Board (DSRB 2013 /00248 and 2013/00635) approved the study protocol and all patients gave written informed consent prior to participation.

1.2. Clinical endpoints and measured patient characteristics

The primary clinical endpoint was hospitalization for HF, defined as clinically diagnosed acute HF requiring hospitalization for >24 hours or treatment with diuretics if duration of stay was <24 hours. For each patient we considered a large set of clinical measurements in our subsequent analysis:

  • Age: the patient’s age at admission.

  • Gender: (F)emale or (M)ale.

  • Race: Various ethnicities.

  • History.of.diabetes: whether the patient has a history of diabetes, (Y)es or (N)o.

  • History.of.hypertension: whether the patient has a history of hypertension, (Y)es or (N)o.

  • History.of.hyperlipidemia: whether the patient has a history of hyperlipidemia, (Y)es or (N)o.

  • Somalogic_Group_Summary: the patient groups, post-MI Heart Failure (incident_HF), post-MI MI (incident_MI), post-MI event free (Control).

  • Somalogic_Group_Full: Divisins of the above groups into incident_HF patients that showed HF only (HF_First) and incident_HF patients that showed HF and MI together (HF_and_MI_together).

  • Diagnosis_Dx: whether the patient had STEMI or NSTEMI.

  • Smoker: whether the patient in currently a smoker (current), an ex-smoker (ex-smoker) or a non-smoker (non).

  • BMI: the patient’s Body Mass Index.

  • Hx_Renal: History of renal disease, (Y)es or (N)o.

  • Hx_MI: History of Myocardial Infarction, (Y)es or (N)o.

  • Hx_HF: History of Heart Failure, (Y)es or (N)o.

  • Hx_PTCA: History of Percutaneous Transluminal Coronary Angioplasty, (Y)es or (N)o.

  • VesselDisease: Existence of vessel disease, one of None, Single, Double and Triple.

  • ACE1 / ACEI_ARB: Angiotensin Converting Enzyme inhibitors medication, (Y)es or (N)o.

  • Month_EF: Left Ventricular Ejection Fraction measured at 4 months.

The same variables are considered in both the CDCS and the IMMACULATE cohorts.

1.3. Proteomic analysis

Plasma samples (50 \(\mu\)l) collected 30 days after the index event were analysed using a Slow Off-rate Modified Aptamer (SOMAmer)–based capture array called “SOMAscan” (somaLogic, Inc, Boulder, CO, USA). We measured 1,305 human proteins (47% secreted proteins, 28% extracellular domains and 25% intracellular proteins).

1.4. Single-cell RNA-seq

We interrogated our published dataset to integrate single cell non-CM transcriptomic data (scRNA-seq) from mice undergoing coronary ligation (MI) versus sham surgery and single nuclei CM transcriptomic data (snRNA-seq) from mice undergoing transverse aortic constriction (TAC) versus sham surgery. For non-CMs, cells were randomly isolated by flow-assisted cell sorting from digested mouse left ventricles 7 days post-MI/sham surgery (N=3 per group). For CMs, nuclei were isolated from digested mouse left ventricles 8 weeks post-TAC/sham surgery (N=5 per group). The cDNA libraries were prepared by SMART-Seq2 protocol and sequenced using HiSeq 2000 (Illumina). Data were further cross-referenced with publicly-available datasets containing single CM transcriptomes from a mouse TAC model and human DCM patients.

2. The Somalogic data: protein expression profiles, experimental design and protein annotation

We analysed 1,305 plasma proteins measured for 750 patients in the CDCS cohort: 250 Control patients, 124 post-MI patients who had readmission for HF (HF group), 57 post-MI patients who had readmission for HF and MI (HF group), 250 post-MI patients who had readmission for MI (MI group) and 69 with cardiac remodelling asfter the initial MI (remodelled group). In the IMMACULATE cohort we had 330 patients: 107 Healthy Control, 190 post-MI event-free (Control) patients and 33 post-MI HF patients (HF group).

Among the 1,305 proteins are the two forms of proBNP, BNP and NT-proBNP. They are practically the same protein resulting from the cleavage of proBNP into its active and inactive form respectively. For simplicity, in this analysis we will consider and refer to them as two proteins.

# source the necessary R packages and functions
source("ProteomicsData/source.R")

################
# CDCS dataset #
################
desi_CDCS <- read.table("ProteomicsData/Somalogic_Design_CDCS.txt", sep = "\t")
exprs_CDCS <- read.table("ProteomicsData/Somalogic_Expression_CDCS.txt", sep = "\t", header = TRUE)
Annot_CDCS <- read.table("ProteomicsData/Somalogic_Annotation_CDCS.txt", sep = "\t", header = TRUE)

# Number of proteins x samples
dim(exprs_CDCS)
## [1] 1305  750
# Number of HF and Control patients
table(desi_CDCS$Somalogic_Group_Full)
## 
##            Control HF_and_MI_together           HF_First           MI_First 
##                250                 57                124                250 
##         Remodelled 
##                 69
######################
# IMMACULATE dataset #
######################
desi_IMMACULATE <- read.table("ProteomicsData/Somalogic_Design_IMMACULATE.txt", sep = "\t")
desi_IMMACULATE_ACS <- data.frame(as.matrix(desi_IMMACULATE[desi_IMMACULATE$Somalogic_Group_Full == "ACS", ]))
exprs_IMMACULATE <- read.table("ProteomicsData/Somalogic_Expression_IMMACULATE.txt", sep = "\t", header = TRUE)
Annot_IMMACULATE <- read.table("ProteomicsData/Somalogic_Annotation_IMMACULATE.txt", sep = "\t", header = TRUE)

# Number of proteins x samples
dim(exprs_IMMACULATE)
## [1] 1305  330
# Number of Healthy, HF and Control patients
table(desi_IMMACULATE$HF)
## 
## ---   N   Y 
## 107 190  33

Below, we show the frequencies of certain important clinical variables with respect to the disease state at CDCS. Fisher exact test did not reveal any significant differences.

# disease state by ethnic group at CDCS
table(desi_CDCS$Race, desi_CDCS$Somalogic_Group_Full)
##                       
##                        Control HF_and_MI_together HF_First MI_First Remodelled
##   African                    1                  0        0        0          0
##   Asian                      4                  1        3        2          0
##   Chinese                    1                  0        0        0          0
##   European                   7                  2        4        7          5
##   Fijian                     1                  0        0        1          0
##   Indian                     0                  0        0        3          1
##   Not_stated                31                  3        3        8          7
##   NZ_European              137                 40       74      147         41
##   NZ_Maori                   4                  3        3       12          1
##   Other_European            61                  6       34       67         13
##   Other_Pacific_Island       1                  0        0        0          0
##   Pacific                    2                  2        3        3          1
# disease state by gender at CDCS
table(desi_CDCS$Gender, desi_CDCS$Somalogic_Group_Full)
##    
##     Control HF_and_MI_together HF_First MI_First Remodelled
##   F      77                 21       38       77         18
##   M     173                 36       86      173         51
# disease state by history of diabetes at CDCS
table(desi_CDCS$History.of.diabetes, desi_CDCS$Somalogic_Group_Full)
##    
##     Control HF_and_MI_together HF_First MI_First Remodelled
##   N     214                 39       83      202         62
##   Y      36                 18       41       48          7
# disease state by history of hypertension at CDCS (--- means NA)
table(desi_CDCS$History.of.hypertension, desi_CDCS$Somalogic_Group_Full)
##      
##       Control HF_and_MI_together HF_First MI_First Remodelled
##   ---       2                  1        4        1          0
##   N       147                 17       42       90         41
##   Y       101                 39       78      159         28
# disease state by Diagnosis at CDCS
table(desi_CDCS$Diagnosis_Dx, desi_CDCS$Somalogic_Group_Full)
##         
##          Control HF_and_MI_together HF_First MI_First Remodelled
##   NSTEMI     166                 45       92      210         47
##   STEMI       84                 12       32       40         22

In the same way we generated the frequency tables of the IMMACULATE cohort that contains only males from 3 ethnic groups. The data of the IMMACULATE cohort come in 2 batches, A and B, with a limited number of HF cases.

# disease state by ethnic group at IMMACULATE
table(desi_IMMACULATE_ACS$Race,desi_IMMACULATE_ACS$HF)
##          
##             N   Y
##   Chinese 116  19
##   Indian   39   5
##   Malay    35   9
# disease state by gender at IMMACULATE
table(desi_IMMACULATE_ACS$Gender,desi_IMMACULATE_ACS$HF)
##    
##       N   Y
##   M 190  33
# disease state by history of diabetes at IMMACULATE
table(desi_IMMACULATE_ACS$History.of.diabetes,desi_IMMACULATE_ACS$HF)
##    
##       N   Y
##   N 158  23
##   Y  32  10
# disease state by history of hypertension at IMMACULATE
table(desi_IMMACULATE_ACS$History.of.hypertension,desi_IMMACULATE_ACS$HF)
##    
##       N   Y
##   N 112  19
##   Y  78  14
# disease state by Diagnosis at IMMACULATE
table(desi_IMMACULATE_ACS$Diagnosis_Dx,desi_IMMACULATE_ACS$HF)
##         
##            N   Y
##   NSTEMI  80   8
##   STEMI  110  25
# disease state by Batch at IMMACULATE
table(desi_IMMACULATE_ACS$Batch,desi_IMMACULATE_ACS$HF)
##    
##       N   Y
##   A 173  27
##   B  17   6

2.1. Quality control

Quality control was done on the raw Somascan data. The data was first normalized to remove within-run hybridization variation followed by median normalization across all samples and finally calibrated to eliminate inter-plate and inter-run differences. The acceptance criteria for normalization were 0.4 to 2.5 and calibration scale factor for the SOMAmer within ±0.4 of the median.

2.2. Data adjustment

The expression profiles of the samples that passed the quality control were log2-transformed and adjusted for a set of confounding factors. For each protein, we used linear modelling to assess any independent effects of the following confounders: sex, age, BMI, smoking status, clinical history (diabetes, hypertension, hyperlipidemia, vessel and renal disease) and medication status (beta-blockers and ACE) on protein abundance. The variables with a false discovery rate <5% in more than 5% of the proteins were fitted in a multiple regression linear model, to minimize potential confounding effects. Due to the small sample size of the African, Chinese, Fijian, Indian and other pacific ethnicities in CDCS, we grouped the respective patients together.

# adjust the CDCS data
rr <- as.character(desi_CDCS$Race)
rr [which(rr == "African" | 
              rr == "Chinese" | 
              rr == "Fijian" | 
              rr == "Other_Pacific_Island" | 
              rr == "Indian")] <- "OtherRace" 
desi_CDCS$Race <- factor(rr)
adj_CDCS <- adjustData_CDCS(Expr = exprs_CDCS, Design = desi_CDCS)

# save the adjusted data
# write.table(adj_CDCS, "ProteomicsData/Adjusted_Expression_CDCS.txt", sep = "\t")

The IMMACULATE cohort consisted of three enthinicites, i.e. Chinese (majority), Malay and Indian. No grouping was performed prior the data adjustment. Here, the adjustment model also accounted for the batch effects:

# adjust the IMMACULATE data
adj_IMMACULATE <- adjustData_IMMACULATE(Expr = exprs_IMMACULATE, Design = desi_IMMACULATE)

# save the adjusted data
# write.table(adj_IMMACULATE, "ProteomicsData/Adjusted_Expression_IMMACULATE.txt", sep = "\t")

The adjusted and log2-transformed values are stored and the samples of interest are extracted for further analysis:

# the adjusted CDCS data
data_CDCS <- CDCS_summary(folder = "ProteomicsData/")
data_CDCS$Expr[1:5, 1:5]
##                   AK0001   AK0004   AK0019   AK0021   AK0025
## SL019100:STUB1  3.030702 2.717001 3.035798 2.718204 2.834960
## SL007136:CEBPB  3.039809 3.019352 3.164512 3.179514 3.042512
## SL001731:ENO2   4.554199 4.398007 4.370560 4.409599 4.489801
## SL019096:PIAS4  2.871452 2.783088 2.949843 2.971706 2.755966
## SL005173:IL10RA 3.246265 3.229963 3.215740 3.253146 3.082142
dim(data_CDCS$Expr)
## [1] 1305  431
data_CDCS$Design[1:5, ]
##        Subject_ID Age Gender           Race History.of.diabetes
## AK0001     AK0001  60      F    NZ_European                   N
## AK0004     AK0004  52      F    NZ_European                   N
## AK0019     AK0019  76      M          Asian                   N
## AK0021     AK0021  74      M Other_European                   N
## AK0025     AK0025  77      F Other_European                   N
##        History.of.hypertension History.of.hyperlipidemia Somalogic_Group_Full
## AK0001                       N                         N              Control
## AK0004                       N                         N              Control
## AK0019                     ---                         Y             HF_First
## AK0021                       Y                         N              Control
## AK0025                       Y                         N             HF_First
##        Somalogic_Group_Summary Diagnosis_Dx TimePoint SiteId    Smoker
## AK0001                 Control       NSTEMI        T2   CDCS Ex-smoker
## AK0004                 Control       NSTEMI        T2   CDCS       Non
## AK0019             incident_HF       NSTEMI        T2   CDCS       Non
## AK0021                 Control       NSTEMI        T2   CDCS Ex-smoker
## AK0025             incident_HF       NSTEMI        T2   CDCS       Non
##                BMI Hx_Renal Hx_MI Hx_HF Hx_PTCA PriStokeTIA VesselDisease
## AK0001 24.63547508        N     N     N       N           N          None
## AK0004  23.5947692        N     N     N       N           N          None
## AK0019 31.84713376        N     Y     Y       N           N           ---
## AK0021  25.5648038        N     Y     N       N           N        Triple
## AK0025 22.76795005        N     N   ---       N           N        Double
##        Thromb_Rplase Thromb_SK Thromb_TPA Beta_blocker_Discharge ACE1 ACEI_ARB
## AK0001             N         N          N                      N    N      ---
## AK0004             N         N          N                      Y    N      ---
## AK0019             N         N          N                      N    Y      ---
## AK0021             N         N          N                      Y    Y      ---
## AK0025             N         N          N                      Y    N      ---
##          change.EDV   change.ESV    change.EF hsTnT_T2 NTproBNP_T2 hsTnI_T2
## AK0001  31.27413127  4.790419162  11.94029851      ---       38.26      3.6
## AK0004 -20.48076923 -12.36842105 -5.238095238      ---       23.42      2.2
## AK0019          ---          ---  54.65116279      ---         ---      ---
## AK0021  14.50777202  7.418397626  4.795737123      ---        57.3      ---
## AK0025  25.80645161  43.25581395 -8.617886179      ---      380.67    502.7
##        Grace.Score FUDays Death_Cause_Any Death_Cause_Cardiovascular Death_Day
## AK0001         ---   3364               N                          N       ---
## AK0004         ---   3353               N                          N       ---
## AK0019         ---    947               Y                          Y       ---
## AK0021         ---   1530               Y                          N       ---
## AK0025         ---    724               Y                          Y       ---
##        HF First_HF  MI First_MI HStroke First_HStroke IStroke First_IStroke
## AK0001  N     3364 ---      ---       N          3364       N          3364
## AK0004  N     3353 ---      ---       N          3353       N          3353
## AK0019  Y      202 ---      ---       N           947       N           947
## AK0021  N     1530 ---      ---       N          1530       Y           614
## AK0025  Y      162 ---      ---       N           724       N           724
##        Spironolactone STEMI First_Stemi NSTEMI First_NStemi UNSPECMI
## AK0001              N     N        3364      N         3364        N
## AK0004              N     N        3353      N         3353        N
## AK0019              N     N         947      N          947        N
## AK0021              N     N        1530      N         1530        N
## AK0025              N     N         724      N          724        N
##        First_UnSpecMI Baseline_EF Month_EF
## AK0001           3364          67       75
## AK0004           3353          63     59.7
## AK0019            947        17.2     26.6
## AK0021           1530        56.3       59
## AK0025            724        61.5     56.2
# the adjusted IMMACULATE data
data_IMMACULATE <- IMMACULATE_summary(folder = "ProteomicsData/")
data_IMMACULATE$Expr[1:5, 1:5]
##                     P014     P053     P054     P058     P060
## SL019100:STUB1  2.451027 2.688015 3.036004 2.726026 2.629335
## SL007136:CEBPB  2.941218 3.277664 2.959143 3.051050 3.080631
## SL001731:ENO2   3.115845 4.241446 4.237186 4.268835 4.132408
## SL019096:PIAS4  2.360218 2.891926 2.770493 2.979254 2.993664
## SL005173:IL10RA 3.132629 3.295121 3.168715 3.232676 3.310879
dim(data_IMMACULATE$Expr)
## [1] 1305  223
data_IMMACULATE$Design[1:5, ]
##      Subject_ID Age Gender    Race History.of.diabetes History.of.hypertension
## P014       P014  34      M Chinese                   N                       N
## P053       P053  54      M Chinese                   N                       Y
## P054       P054  46      M Chinese                   Y                       Y
## P058       P058  62      M Chinese                   Y                       Y
## P060       P060  46      M Chinese                   N                       N
##      History.of.hyperlipidemia Somalogic_Group_Full Somalogic_Group_Summary
## P014                         Y                  ACS                 Reverse
## P053                         N                  ACS                 Adverse
## P054                         Y                  ACS                 Reverse
## P058                         N                  ACS                 Reverse
## P060                         Y                  ACS                 Adverse
##      Diagnosis_Dx TimePoint SiteId  Smoker         BMI Hx_Renal Hx_MI Hx_HF
## P014        STEMI        T2    NUH Current 26.13920442        N     N     N
## P053        STEMI        T2    NUH Current 26.62070286        N     N     N
## P054        STEMI        T2    NUH Current  24.4646016        N     N     N
## P058        STEMI        T2    NUH     Non  24.5389649        N     N     N
## P060        STEMI        T2    NUH     Non  24.8015873        N     N     N
##      Hx_PTCA PriStokeTIA VesselDisease Thromb_Rplase Thromb_SK Thromb_TPA
## P014       N         ---        Single           ---       ---        ---
## P053       N         ---        Double           ---       ---        ---
## P054       N         ---        Single           ---       ---        ---
## P058       N         ---        Single           ---       ---        ---
## P060       N         ---        Single           ---       ---        ---
##      Beta_blocker_Discharge ACE1 ACEI_ARB change.EDV change.ESV change.EF
## P014                      Y  ---        Y        ---        ---       ---
## P053                      Y  ---        Y        ---        ---       ---
## P054                      Y  ---      ---        ---        ---       ---
## P058                      Y  ---        Y        ---        ---       ---
## P060                      Y  ---        Y        ---        ---       ---
##      hsTnT_T2 NTproBNP_T2 hsTnI_T2 Grace.Score FUDays Death_Cause_Any
## P014      ---         ---      ---         ---    ---             ---
## P053      ---         ---      ---         ---    ---             ---
## P054      ---         ---      ---         ---    ---             ---
## P058      ---         ---      ---         ---    ---             ---
## P060      ---         ---      ---         ---    ---             ---
##      Death_Cause_Cardiovascular Death_Day HF First_HF  MI First_MI HStroke
## P014                        ---       ---  Y      --- ---      ---     ---
## P053                        ---       ---  N      --- ---      ---     ---
## P054                        ---       ---  Y      --- ---      ---     ---
## P058                        ---       ---  N      --- ---      ---     ---
## P060                        ---       ---  N      --- ---      ---     ---
##      First_HStroke IStroke First_IStroke Spironolactone STEMI First_Stemi
## P014           ---     ---           ---            ---   ---         ---
## P053           ---     ---           ---            ---   ---         ---
## P054           ---     ---           ---            ---   ---         ---
## P058           ---     ---           ---            ---   ---         ---
## P060           ---     ---           ---            ---   ---         ---
##      NSTEMI First_NStemi UNSPECMI First_UnSpecMI Batch Baseline_EF Month_EF
## P014    ---          ---      ---            ---     B        47.2     47.1
## P053    ---          ---      ---            ---     B          42     52.9
## P054    ---          ---      ---            ---     B        36.4     48.1
## P058    ---          ---      ---            ---     B        43.1     44.1
## P060    ---          ---      ---            ---     B        47.6     45.1

We use t-Distributed Stochastic neighbor Embedding (tSNE) to contruct a low dimensional embedding of the high dimensional IMMACULATE data and visualise the successfull batch effect minimization. The tSNE analysis used all proteins that passed the quality control.

# tSNE of IMMACULATE (with Batch information)
TSNEplot(Expr = data_IMMACULATE$Expr,
         Design = data_IMMACULATE$Design,
         Annotation = data_IMMACULATE$Annotation,
         cohort = "IMMACULATE",
         diseaseColumn = 53)

3. Somalogic data analysis (Proteomics)

This section outlines the analytical steps we followed for the analysis of the Somalogic data. It includes the individual pipelines for each cohort and their integration with the Left Ventricular Ejection Fraction (LVEF) measurements.

3.1. Test for variances

Before proceeding to the classical differential expression analysis that identifies proteins whose expression levels differ across HF and Control, we run Levene’s test to identify potential proteins whose variances differ significantly across those states. Levene’s test does not make any assumptions oon the data distribution.

#############################
# test for variance in CDCS #
#############################
het_CDCS <- LeveneTest(Expr = data_CDCS$Expr,
                       Design = data_CDCS$Design,
                       Annotation = data_CDCS$Annotation,
                       cohort = "CDCS",
                       diseaseColumn = 9)
het_CDCS[1:5,]
##            Protein              LeveneP            adj.P.Val
## 130  SL001777:CST3 1.44943373921585e-09 1.63496125783548e-06
## 58   SL008631:DSC2 5.71533636980021e-08 3.18071085537748e-05
## 667    SL000586:F2 8.45933738132309e-08 3.18071085537748e-05
## 53  SL011888:SMOC1 2.71041139465109e-07 7.64336013291607e-05
## 31   SL004557:CD59 3.89726779371348e-07 8.79223614261761e-05
# store the results
# write.table(het_CDCS, "ProteomicsData/VarTest_CDCS.txt", sep = "\t")

###################################
# test for variance in IMMACULATE #
###################################
het_IMMACULATE <- LeveneTest(Expr = data_IMMACULATE$Expr,
                       Design = data_IMMACULATE$Design,
                       Annotation = data_IMMACULATE$Annotation,
                       cohort = "IMMACULATE",
                       diseaseColumn = 38)
het_IMMACULATE[1:5,]
##            Protein              LeveneP         adj.P.Val
## 315  SL005205:NCR3 5.11693747145376e-05 0.055584265274954
## 817   SL000087:IL6   9.405121027911e-05 0.055584265274954
## 731 SL011535:RBM39 0.000375594725688858  0.14798432192141
## 697    SL000586:F2 0.000594794949622844  0.17576190761355
## 95   SL000053:PLAT  0.00157844533766554  0.23648242138832
# store the results
# write.table(het_IMMACULATE, "ProteomicsData/VarTest_IMMACULATE.txt", sep = "\t")

In CDCS we identified 50 proteins with significant adjusted P-values at FDR = 1% while in the IMMACULATE there were no such significant differences, potentially due to the small sample size of HF group.

# significant levene test in CDCS
table(as.numeric(as.character(het_CDCS$adj.P.Val)) <= 0.01)
## 
## FALSE  TRUE 
##  1078    50
# significant levene test in IMMACULATE
table(as.numeric(as.character(het_IMMACULATE$adj.P.Val)) <= 0.01)
## 
## FALSE 
##  1182

3.2. Differential expression analysis

The differential expression analysis was performed with the limma model as follows:

# protein differential expression in CDCS
DEs_CDCS <- DE_forHF(Expr = data_CDCS$Expr, 
                     Design = data_CDCS$Design,
                     Annotation = data_CDCS$Annotation,
                     Cohort = "CDCS")
DEs_CDCS[1:5, ]
##                  SomaId                TargetFullName             Target
## SL002785:NPPB  SL002785            N-terminal_pro-BNP N-terminal_pro-BNP
## SL001777:CST3  SL001777                    Cystatin-C         Cystatin_C
## SL006119:TFF3  SL006119              Trefoil_factor_3               TFF3
## SL009324:FSTL3 SL009324 Follistatin-related_protein_3              FSTL3
## SL008099:CAPG  SL008099    Macrophage-capping_protein               CAPG
##                UniProt EntrezGeneID EntrezGeneSymbol Flag     logFC  AveExpr
## SL002785:NPPB   P16860         4879             NPPB PASS 0.8391699 3.571936
## SL001777:CST3   P01034         1471             CST3 PASS 0.2552607 2.947754
## SL006119:TFF3   Q07654         7033             TFF3 PASS 0.3352670 3.571181
## SL009324:FSTL3  O95633        10272            FSTL3 PASS 0.2234513 3.348035
## SL008099:CAPG   P40121          822             CAPG PASS 0.3413176 2.387919
##                       t      P.Value    adj.P.Val Comparison
## SL002785:NPPB  8.545723 2.207961e-16 2.881389e-13 HF-Control
## SL001777:CST3  6.939563 1.443106e-11 9.416268e-09 HF-Control
## SL006119:TFF3  6.757341 4.552645e-11 1.980400e-08 HF-Control
## SL009324:FSTL3 6.190932 1.390123e-09 3.754770e-07 HF-Control
## SL008099:CAPG  6.171111 1.560127e-09 3.754770e-07 HF-Control
# store the results
# write.table(DEs_CDCS, "ProteomicsData/Differential_Expression_CDCS.txt", sep = "\t")

We considered as differentially expressed the proteins with FDR < 5% (column adj.P.Val) that passed the quality control. The differentially expressed proteins are shown in a volcano plot:

# volcano plot of differential expression estimates in CDCS
VolcanoPlot(de = DEs_CDCS, cohort = "CDCS")
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

We found that 38 out of the 212 differentially expressed proteins have different variances across the HF and Control groups which may also contribute to the differences in the respective expression levels.

# differentially expressed proteins in CDCS
DEproteins_in_cdcs <- as.character(rownames(DEs_CDCS[DEs_CDCS$adj.P.Val <= 0.05 & DEs_CDCS$Flag == "PASS", ]))
length(DEproteins_in_cdcs)
## [1] 212
# proteins with differential variability in CDCS 
DVproteins_in_cdcs <- as.character(het_CDCS[as.numeric(as.character(het_CDCS$adj.P.Val)) <= 0.01, 1])
length(DVproteins_in_cdcs)
## [1] 50
# differentially expressed proteins with different variances across the patient groups
length(intersect(DEproteins_in_cdcs, DVproteins_in_cdcs))
## [1] 38

The IMMACULATE cohort was underpowered since it included only 33 HF cases which led to a low detection power of differentially expressed proteins (only 12 proteins passed the above criteria). Below we present the best 15 hits at FDR = 10% which include the famous NPPB (proBNP) and NPPA (atrial natriuretic factor) proteins. The most famous marker NT-proBNP is the next (16th) ranked with FDR = 0.106.

# protein differential expression in IMMACULATE
DEs_IMMACULATE <- DE_forHF(Expr = data_IMMACULATE$Expr, 
                     Design = data_IMMACULATE$Design,
                     Annotation = data_IMMACULATE$Annotation,
                     Cohort = "IMMACULATE")
DEs_IMMACULATE[1:15, ]
##                   SomaId
## SL007033:LTBP4  SL007033
## SL007696:CD93   SL007696
## SL003320:FIGF   SL003320
## SL005115:SPON1  SL005115
## SL007680:ROBO2  SL007680
## SL008416:MRC2   SL008416
## SL001996:ANGPT2 SL001996
## SL009324:FSTL3  SL009324
## SL004258:ADIPOQ SL004258
## SL000053:PLAT   SL000053
## SL000087:IL6    SL000087
## SL000306:NPPB   SL000306
## SL000124:MMP2   SL000124
## SL002505:NPPA   SL002505
## SL000076:CDKN1B SL000076
##                                                           TargetFullName
## SL007033:LTBP4  Latent-transforming_growth_factor_beta-binding_protein_4
## SL007696:CD93                          Complement_component_C1q_receptor
## SL003320:FIGF                       Vascular_endothelial_growth_factor_D
## SL005115:SPON1                                                 Spondin-1
## SL007680:ROBO2                                      Roundabout_homolog_2
## SL008416:MRC2                                  C-type_mannose_receptor_2
## SL001996:ANGPT2                                           Angiopoietin-2
## SL009324:FSTL3                             Follistatin-related_protein_3
## SL004258:ADIPOQ                                              Adiponectin
## SL000053:PLAT                          Tissue-type_plasminogen_activator
## SL000087:IL6                                               Interleukin-6
## SL000306:NPPB                               Brain_natriuretic_peptide_32
## SL000124:MMP2                                 72_kDa_type_IV_collagenase
## SL002505:NPPA                                  Atrial_natriuretic_factor
## SL000076:CDKN1B                     Cyclin-dependent_kinase_inhibitor_1B
##                         Target UniProt EntrezGeneID EntrezGeneSymbol FlagA
## SL007033:LTBP4           LTBP4  Q8N2S1         8425            LTBP4  PASS
## SL007696:CD93            C1QR1  Q9NPY3        22918             CD93  PASS
## SL003320:FIGF           VEGF-D  O43915         2277             FIGF  PASS
## SL005115:SPON1       Spondin-1  Q9HCB6        10418            SPON1  PASS
## SL007680:ROBO2           ROBO2  Q9HCK4         6092            ROBO2  PASS
## SL008416:MRC2             MRC2  Q9UBG0         9902             MRC2  PASS
## SL001996:ANGPT2 Angiopoietin-2  O15123          285           ANGPT2  PASS
## SL009324:FSTL3           FSTL3  O95633        10272            FSTL3  PASS
## SL004258:ADIPOQ    Adiponectin  Q15848         9370           ADIPOQ  PASS
## SL000053:PLAT              tPA  P00750         5327             PLAT  PASS
## SL000087:IL6              IL-6  P05231         3569              IL6  PASS
## SL000306:NPPB           BNP-32  P16860         4879             NPPB  PASS
## SL000124:MMP2            MMP-2  P08253         4313             MMP2  PASS
## SL002505:NPPA              ANP  P01160         4878             NPPA  PASS
## SL000076:CDKN1B        p27Kip1  P46527         1027           CDKN1B  PASS
##                         Protein      logFC  AveExpr         t      P.Value
## SL007033:LTBP4   SL007033:LTBP4  0.2451720 3.007896  5.977009 8.858637e-09
## SL007696:CD93     SL007696:CD93  0.2532191 3.988323  5.072210 8.235112e-07
## SL003320:FIGF     SL003320:FIGF  0.3115731 2.524268  5.021685 1.044086e-06
## SL005115:SPON1   SL005115:SPON1  0.2287463 3.259998  4.980009 1.268187e-06
## SL007680:ROBO2   SL007680:ROBO2  0.1932418 3.504850  4.812317 2.739857e-06
## SL008416:MRC2     SL008416:MRC2  0.1983016 3.551319  4.437876 1.424627e-05
## SL001996:ANGPT2 SL001996:ANGPT2  0.2977880 2.116766  4.393125 1.723218e-05
## SL009324:FSTL3   SL009324:FSTL3  0.1766642 3.485938  4.054383 6.935081e-05
## SL004258:ADIPOQ SL004258:ADIPOQ  0.3155099 3.274410  3.971442 9.625537e-05
## SL000053:PLAT     SL000053:PLAT  0.3399825 2.397425  3.834363 1.635838e-04
## SL000087:IL6       SL000087:IL6  0.3450569 2.361511  3.693566 2.778209e-04
## SL000306:NPPB     SL000306:NPPB  0.5084806 2.602311  3.601073 3.901509e-04
## SL000124:MMP2     SL000124:MMP2  0.1421153 3.852925  3.526748 5.100632e-04
## SL002505:NPPA     SL002505:NPPA  0.4497471 2.833070  3.449183 6.715464e-04
## SL000076:CDKN1B SL000076:CDKN1B -0.5096623 2.879893 -3.311893 1.079999e-03
##                    adj.P.Val Comparison
## SL007033:LTBP4  1.156052e-05 HF-Control
## SL007696:CD93   4.137462e-04 HF-Control
## SL003320:FIGF   4.137462e-04 HF-Control
## SL005115:SPON1  4.137462e-04 HF-Control
## SL007680:ROBO2  7.151027e-04 HF-Control
## SL008416:MRC2   3.098563e-03 HF-Control
## SL001996:ANGPT2 3.212570e-03 HF-Control
## SL009324:FSTL3  1.131285e-02 HF-Control
## SL004258:ADIPOQ 1.395703e-02 HF-Control
## SL000053:PLAT   2.134768e-02 HF-Control
## SL000087:IL6    3.295966e-02 HF-Control
## SL000306:NPPB   4.242891e-02 HF-Control
## SL000124:MMP2   5.120249e-02 HF-Control
## SL002505:NPPA   6.259772e-02 HF-Control
## SL000076:CDKN1B 9.395991e-02 HF-Control
# store the results
# write.table(DEs_IMMACULATE, "ProteomicsData/Differential_Expression_IMMACULATE.txt", sep = "\t")

# number of significant proteins in IMMACULATE (FDR = 5%)
DEproteins_in_immaculate5 <- as.character(rownames(DEs_IMMACULATE[DEs_IMMACULATE$adj.P.Val <= 0.05 & DEs_IMMACULATE$FlagA == "PASS", ]))
length(DEproteins_in_immaculate5)
## [1] 12
# number of significant proteins in IMMACULATE (FDR = 10%)
DEproteins_in_immaculate10 <- as.character(rownames(DEs_IMMACULATE[DEs_IMMACULATE$adj.P.Val <= 0.1 & DEs_IMMACULATE$FlagA == "PASS", ]))
length(DEproteins_in_immaculate10)
## [1] 15
# the rank of NT-proBNP
grep("N-terminal_pro-BNP", DEs_IMMACULATE$TargetFullName)
## [1] 16
DEs_IMMACULATE[grep("N-terminal_pro-BNP", DEs_IMMACULATE$TargetFullName), ]
##                 SomaId     TargetFullName             Target UniProt
## SL002785:NPPB SL002785 N-terminal_pro-BNP N-terminal_pro-BNP  P16860
##               EntrezGeneID EntrezGeneSymbol FlagA       Protein     logFC
## SL002785:NPPB         4879             NPPB  PASS SL002785:NPPB 0.7968104
##                AveExpr        t     P.Value adj.P.Val Comparison
## SL002785:NPPB 4.257362 3.255048 0.001309033  0.106768 HF-Control

In terms of differential expression at FDR < 5%, we find 8 reproducible proteins across the two cohorts:

# common proteins by FDR
common_by_fdr <- intersect(DEproteins_in_immaculate5, DEproteins_in_cdcs)

# common by FDR and directionality
common_CDCS <- DEs_CDCS[match(common_by_fdr, rownames(DEs_CDCS)), ]
common_IMMACULATE <- DEs_IMMACULATE[match(common_by_fdr, rownames(DEs_IMMACULATE)), ]
ww <- sign(common_CDCS[,8]) == sign(common_IMMACULATE[,9])
table(ww)
## ww
## FALSE  TRUE 
##     1     8
common_CDCS <- common_CDCS[ww, ]
common_IMMACULATE <- common_IMMACULATE[ww, ]

# common proteins
as.character(rownames(common_CDCS))
## [1] "SL007033:LTBP4"  "SL007696:CD93"   "SL003320:FIGF"   "SL005115:SPON1" 
## [5] "SL001996:ANGPT2" "SL009324:FSTL3"  "SL004258:ADIPOQ" "SL000306:NPPB"

We used the meta-analysis metaRNASeq algorithm to combine the information of the CDCS and IMMACULATE experiments and provide an indication as to which proteins tend to be significant in both cohorts (and thus cross-cohort validated). We used strict criteria to identify them, i.e. they should be (1) differentially expressed in the CDCS (FDR < 5% that pass the quality control), (2) show evidence of differential expression in the IMMACULATE (P-value < 5% that pass the quality control), (3) differentially expressed in the meta-analysis (meta-analysis FDR < 5%) and (4) the logFCs in the CDCS and the IMMACULATE have the same direction (both up-regulated or both down-regulated in HF).

# meta-analysis
meta <- metaCohorts(DE_CDCS = DEs_CDCS,
                    DE_IMMACULATE = DEs_IMMACULATE)
head(meta)
##                 isDE_byFDR5perc_CDCS isDE_byFDR5perc_IMM isDE_byP5perc_IMM
## SL002785:NPPB                   TRUE               FALSE              TRUE
## SL006119:TFF3                   TRUE               FALSE              TRUE
## SL009324:FSTL3                  TRUE                TRUE              TRUE
## SL006527:EFEMP1                 TRUE               FALSE              TRUE
## SL003869:GDF15                  TRUE               FALSE              TRUE
## SL003320:FIGF                   TRUE                TRUE              TRUE
##                 signIndex Fisher_meta_FDR Validated
## SL002785:NPPB           1    0.000000e+00      TRUE
## SL006119:TFF3           1    1.032246e-08      TRUE
## SL009324:FSTL3          1    1.099741e-09      TRUE
## SL006527:EFEMP1         1    4.413441e-08      TRUE
## SL003869:GDF15          1    2.655746e-07      TRUE
## SL003320:FIGF           1    2.503278e-10      TRUE
# store the results
# write.table(meta, "ProteomicsData/Meta_IMMACULATE.txt", sep = "\t")

# number of validated proteins (TRUE)
table(meta$Validated)
## 
## FALSE  TRUE 
##  1069    36

This procedure found 36 cross-cohort validated proteins of which NPPB (NT-proBNP) was at the top of the list (meta-analysis FDR < 1e-16). Several other famous candidates also appeared such as TNNT2, NPPB (pro-BNP), TNNI3 etc.

3.3. Dimensionality reduction of the IMMACULATE protein expression profiles

To check the importance of the significant proteins derived from the IMMACULATE cohort, we examine the percentage of variance explained by them in a PCA plot and visualise how well they separate the patient groups of interest with tSNE. First we check the 15 proteins that were found to be differentially expressed at FDR = 10%. The results show a clear improvement in the separation of HF and Control groups compared to the above (all proteins) tSNE representation.

# differentially expressed proteins at FDR = 10%
proteins15 <- rownames( DEs_IMMACULATE[ DEs_IMMACULATE$adj.P.Val <= 0.1,])
  
# significant proteins at FDR = 10%
k15 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
                        Design = data_IMMACULATE$Design,
                        Annotation = data_IMMACULATE$Annotation,
                        proteins = proteins15,
                        diseaseColumn = 38,
                        legend = "significant proteins at FDR = 10%")

## NULL

Next, we study the 36 proteins that have been validated in both cohorts by meta-analysis. Note that 5 out of the 15 proteins with FDR < 10% in the IMMACULATE cohort were not validated in the meta-analysis (they were not highly significant in the CDCS), 10 proteins were common by both methods and 26 proteins were found only in the meta-analysis (among which is the famous NT-proBNP marker). As before, the separation of HF and Control groups is clear.

# all validated proteins of meta-analysis
proteins36 <- rownames(meta)[meta$Validated]

# proteins in top 15 that were not cross-cohort validated
setdiff(proteins15, proteins36)
## [1] "SL007680:ROBO2"  "SL008416:MRC2"   "SL000053:PLAT"   "SL000124:MMP2"  
## [5] "SL000076:CDKN1B"
# common proteins in top 15 cross-cohort validation
intersect(proteins15, proteins36)
##  [1] "SL007033:LTBP4"  "SL007696:CD93"   "SL003320:FIGF"   "SL005115:SPON1" 
##  [5] "SL001996:ANGPT2" "SL009324:FSTL3"  "SL004258:ADIPOQ" "SL000087:IL6"   
##  [9] "SL000306:NPPB"   "SL002505:NPPA"
# cross-cohort validated proteins that were not among the top 15
setdiff(proteins36, proteins15)
##  [1] "SL002785:NPPB"     "SL006119:TFF3"     "SL006527:EFEMP1"  
##  [4] "SL003869:GDF15"    "SL000052:TNNT2"    "SL007056:BMP10"   
##  [7] "SL006610:ADAMTS13" "SL007206:THBS2"    "SL004646:LAYN"    
## [10] "SL009400:CHRDL1"   "SL011888:SMOC1"    "SL004060:ECE1"    
## [13] "SL003327:CFD"      "SL008382:CST5"     "SL003324:F10"     
## [16] "SL003770:SFRP1"    "SL005230:UNC5C"    "SL005789:STC1"    
## [19] "SL000592:TIMP2"    "SL001761:TNNI3"    "SL006924:AGRP"    
## [22] "SL005209:NOTCH3"   "SL010381:DKKL1"    "SL002823:SELL"    
## [25] "SL006230:LUM"      "SL000462:IGFBP1"
# cross-cohort validated proteins
k36 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
                         Design = data_IMMACULATE$Design,
                         Annotation = data_IMMACULATE$Annotation,
                         proteins = proteins36,
                         diseaseColumn = 38,
                         legend = "36 cross-cohort validated proteins")

## NULL

Finally, we study only the separation offered by the 26 cross-cohort validated proteins that are not detected by IMMACULATE’s FDR cutoff. Again, the separation of the two groups is visually clear indicating the importance of these candidates.

# proteins uniquely found in meta-analysis
proteins26 <- setdiff(proteins36, proteins15)

# cross-cohort validated proteins with FDR > 10%
k26 <- redDim_IMMACULATE(Expr = data_IMMACULATE$Expr,
                         Design = data_IMMACULATE$Design,
                         Annotation = data_IMMACULATE$Annotation,
                         proteins = proteins26, 
                         diseaseColumn = 38,
                         legend = "26 cross-cohort validated proteins with FDR > 10%")

## NULL

3.4. The significance of the cross-cohort validated proteins in separating the IMMACULATE patient groups

To quantify the significance of the 36 proteins extracted from the meta-analysis, we developed a simple methodology that compares the HF and Control clusters generated from those proteins with the clusters formed by a random selection of any 36 proteins. The random selection is taken from the pool of the CDCS significant proteins in order to build a strict testing procedure that takes into account meaningful candidates. The algorithm we developed works as follows:

  • Step 1: Estimate the test statistic \(t = (M_{Control} - M_{HF}) / \sqrt{(Var_{Control} + Var_{HF})\) where \(M_{Control}\) is the 2D mean of the Control group samples estimated from their PCA coordinates, \(M_{HF}\) is the 2D mean of the HF group samples estimated from their PCA coordinates, \(Var_{Control}\) is the variance of the Control group estimated from the within cluster sum of squares using Euclidean distances and \(Var_{HF}\) is the variance of the HF group estimated from the within cluster sum of squares using Euclidean distances. This is similar to a t-statistic quantifying the mean difference of two samples with unequal variances and sample sizes.

  • Step 2: At the first iteration b = 1, select 36 random proteins from the pool of the 212 significant CDCS proteins, estimate a PCA and evaluate the resampling statistic \(t^* = (M^*_{Control} - M^*_{HF}) / \sqrt{(Var^*_{Control} + Var^*_{HF})\) where \(M^*_{Control}\), \(M^*_{HF}\), \(Var^*_{Control}\) and \(Var^*_{HF}\) are defined as above.

  • Step 3: Iterate for b = 2, …, B = 10,000 and compare the resampling statistics \(t^*\) against \(t\) to obtain a resampling p-value: \(P = #(t^* > t) / B\) that evaluates the significance of the group separation offered by the 36 identified proteins.

# run the algorithm 10,000 times
background.proteins <- DEproteins_in_cdcs
clStat <- clusterSignificance(pcadata = k36,
                              randomSet = 36,
                              proteins = background.proteins,
                              B = 10000)

# store the results of the t*-statistics
# write(clStat$testB, "ProteomicsData/tstar.txt", ncolumns = 1) 

This procedure indicated that the 36 selected proteins can significantly separate the HF and Control groups with P < 1e-4.

# read the stored t-statistics for simplicity
clStat_all.t <- scan("ProteomicsData/tstar.txt")

# the t-statistic of the 36 set is the first element
clStat_t <- clStat_all.t[1]
clStat_t
## [1] 0.7527349
# the t*-statistics of the resampling are the rest of the elements
clStat_tstar <- clStat_all.t[2:length(clStat_all.t)]

# see the histogram of the 10,000 t*-statistics
hist(clStat_tstar, xlim = c(0, 1), xlab = "t-statistics", 
    main = "Cluster comparison statistic of 10,000 random protein sets",
    sub = "The vertical line is the t-stat of the 36 proteins")
abline(v = clStat_t)

#p1<-ggplot(hey,aes(x=clStat_tstar)) + geom_histogram(colour="gray") +   
#      labs(y="Frequency",x="t-star") + 
#        geom_vline(xintercept=clStat_t,colour="red",linetype="dashed") +
#        lims(x=c(0,0.8))

3.5. The significance of the cross-cohort validated proteins in predicting the patient groups across cohorts

The above methodology confirmed that the 36 protein separation of the groups in the IMMACULATE is not random. We are also interested how well this validated set performs in predicting the HF and Control groups in both cohorts. To do this, we developed the following machine learning methodology using supervised Random Forests:

  • Step 1: Apply the supervised random forests algorithm on the 36 cross-cohort validated proteins in the IMMACULATE test cohort and predict the HF and Control groups in the CDCS cohort. Calculate the confusion matrix, \(k\), between the estimated and the known disease groups.

  • Step 2: Sort the CDCS and the IMMACULATE DE results by FDR, pick those with FDR < 5% in both sets and repeat Step 1 to find the new confusion matrix \(top_1\).

  • Step 3: Iterate with a small step \(\epsilon = 0.01\) to pick the proteins with FDR < 5% + \(\epsilon\) and find new confusion matrices \(top_2\), …, \(top_B\). Stop the process at the cutoff FDR = 0.5.

  • Step 4: Compare the estimation error of \(k\) against the estimation errors of the confusion matrices of the top IMMACULATE proteins.

# random forest application with 2,000 trees
rf<-rfGroupSeparation(Expr_CDCS = data_CDCS$Expr,
                      Expr_IMMACULATE = data_IMMACULATE$Expr,
                      Design_CDCS = data_CDCS$Design,
                      Design_IMMACULATE = data_IMMACULATE$Design,
                      DEs_CDCS = DEs_CDCS,
                      DEs_IMMACULATE = DEs_IMMACULATE,
                      proteins = proteins36,
                      ntree = 2000,
                      starting.fdr = 0.05,
                      ending.fdr = 0.5)

# store the results
# write.table(rf, "ProteomicsData/RF.txt", sep = "\t")

The results shows that our 36 proteins is the top predictive set exhibiting minimal estimation error in both individual groups and in total (mean). Of particular importance is that, as the figures show, these 36 proteins form a relatively large set whose prediction accuracy is directly comparable to the accuracies of much smaller protein sets in Control. The moderate to low error magnitudes could be attributed to the data noise, the unbalanced design, the ethnicity differences across the two cohorts and, importantly, to the fact that the Control and HF patients are essentially both disease groups (there is no classical healthy control).

# read the stored estimation errors for simplicity
rf <- read.table("ProteomicsData/RF.txt", sep = "\t", header = T)
rf
##    N.of.proteins         N         Y
## 1              8 0.3675676 0.2622951
## 2              9 0.3691173 0.2565789
## 3             10 0.3676471 0.2368421
## 4             11 0.3671551 0.2301065
## 5             13 0.3701405 0.2297313
## 6             14 0.3713528 0.2407407
## 7             16 0.3713528 0.2407407
## 8             17 0.3728041 0.2288185
## 9             18 0.3733681 0.2083333
## 10            21 0.3762892 0.2111936
## 11            22 0.3775773 0.1976744
## 12            26 0.3788660 0.2093023
## 13            27 0.3791774 0.2023810
## 14            28 0.3810701 0.2001251
## 15            32 0.3801020 0.1794872
## 16            33 0.3826531 0.2051282
## 17            36 0.3704663 0.1555556
## 18            37 0.3762887 0.1860465
## 19            40 0.3792878 0.1911555
## 20            41 0.3794872 0.1951220
# estimate the mean errors of both groups
me <- apply(rf[, 2:3], 1, mean)

# plot the results
par(mfrow=c(2, 2))
plot(rf[rf[,1] != 36, 1:2], 
     pch = 20, cex = 1.5, ylim = range(rf[, 2]),
     xlab = "Number of proteins",
     ylab = "Estimation error (Control)",
     sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], rf[rf[, 1] != 36, 2]))
points(rf[rf[, 1] == 36, 1:2], col = 2, pch = 3)

plot(rf[rf[,1] != 36, c(1, 3)], 
     pch = 20, cex = 1.5, ylim = range(rf[, 3]),
     xlab = "Number of proteins",
     ylab = "Estimation error (HF)",
     sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], rf[rf[, 1] != 36, 3]))
points(rf[rf[, 1] == 36, c(1, 3)], col = 2, pch = 3)


plot(rf[rf[,1] != 36, 1], me[rf[,1] != 36], 
     pch = 20, cex = 1.5, ylim = range(me),
     xlab = "Number of proteins",
     ylab = "Estimation error (Mean)",
     sub = "The cross indicates the estimate of the 36 proteins")
lines(lowess(rf[rf[, 1] != 36, 1], me[rf[, 1] != 36]))
points(rf[rf[, 1] == 36, 1], me[rf[, 1] == 36], col = 2, pch = 3)

Our methodology keeps an important amount of information that would have been lost if we used only the FDR for validating the candidates of both cohorts. This is clearly highlighted by looking at the differential expression statistics of the 36 proteins in the IMMACULATE (by default all of them are significant in CDCS). This set cannot be reproduced by a method that is based on FDR-sorteds proteins:

# differential expressin statistics of the 36 proteins in the IMMACULATE
DEs_IMMACULATE[match(proteins36, rownames(DEs_IMMACULATE)),]
##                     SomaId
## SL002785:NPPB     SL002785
## SL006119:TFF3     SL006119
## SL009324:FSTL3    SL009324
## SL006527:EFEMP1   SL006527
## SL003869:GDF15    SL003869
## SL003320:FIGF     SL003320
## SL000052:TNNT2    SL000052
## SL001996:ANGPT2   SL001996
## SL007056:BMP10    SL007056
## SL000306:NPPB     SL000306
## SL006610:ADAMTS13 SL006610
## SL007206:THBS2    SL007206
## SL004646:LAYN     SL004646
## SL004258:ADIPOQ   SL004258
## SL005115:SPON1    SL005115
## SL009400:CHRDL1   SL009400
## SL011888:SMOC1    SL011888
## SL004060:ECE1     SL004060
## SL003327:CFD      SL003327
## SL008382:CST5     SL008382
## SL003324:F10      SL003324
## SL003770:SFRP1    SL003770
## SL005230:UNC5C    SL005230
## SL005789:STC1     SL005789
## SL000592:TIMP2    SL000592
## SL007696:CD93     SL007696
## SL001761:TNNI3    SL001761
## SL006924:AGRP     SL006924
## SL005209:NOTCH3   SL005209
## SL007033:LTBP4    SL007033
## SL010381:DKKL1    SL010381
## SL002823:SELL     SL002823
## SL006230:LUM      SL006230
## SL000462:IGFBP1   SL000462
## SL002505:NPPA     SL002505
## SL000087:IL6      SL000087
##                                                                      TargetFullName
## SL002785:NPPB                                                    N-terminal_pro-BNP
## SL006119:TFF3                                                      Trefoil_factor_3
## SL009324:FSTL3                                        Follistatin-related_protein_3
## SL006527:EFEMP1          EGF-containing_fibulin-like_extracellular_matrix_protein_1
## SL003869:GDF15                                     Growth/differentiation_factor_15
## SL003320:FIGF                                  Vascular_endothelial_growth_factor_D
## SL000052:TNNT2                                           Troponin_T,_cardiac_muscle
## SL001996:ANGPT2                                                      Angiopoietin-2
## SL007056:BMP10                                        Bone_morphogenetic_protein_10
## SL000306:NPPB                                          Brain_natriuretic_peptide_32
## SL006610:ADAMTS13 A_disintegrin_and_metalloproteinase_with_thrombospondin_motifs_13
## SL007206:THBS2                                                     Thrombospondin-2
## SL004646:LAYN                                                               Layilin
## SL004258:ADIPOQ                                                         Adiponectin
## SL005115:SPON1                                                            Spondin-1
## SL009400:CHRDL1                                              Chordin-like_protein_1
## SL011888:SMOC1                      SPARC-related_modular_calcium-binding_protein_1
## SL004060:ECE1                                        Endothelin-converting_enzyme_1
## SL003327:CFD                                                    Complement_factor_D
## SL008382:CST5                                                            Cystatin-D
## SL003324:F10                                                  Coagulation_factor_Xa
## SL003770:SFRP1                                  Secreted_frizzled-related_protein_1
## SL005230:UNC5C                                                Netrin_receptor_UNC5C
## SL005789:STC1                                                       Stanniocalcin-1
## SL000592:TIMP2                                        Metalloproteinase_inhibitor_2
## SL007696:CD93                                     Complement_component_C1q_receptor
## SL001761:TNNI3                                           Troponin_I,_cardiac_muscle
## SL006924:AGRP                                                Agouti-related_protein
## SL005209:NOTCH3                            Neurogenic_locus_notch_homolog_protein_3
## SL007033:LTBP4             Latent-transforming_growth_factor_beta-binding_protein_4
## SL010381:DKKL1                                              Dickkopf-like_protein_1
## SL002823:SELL                                                            L-Selectin
## SL006230:LUM                                                                Lumican
## SL000462:IGFBP1                        Insulin-like_growth_factor-binding_protein_1
## SL002505:NPPA                                             Atrial_natriuretic_factor
## SL000087:IL6                                                          Interleukin-6
##                                           Target UniProt EntrezGeneID
## SL002785:NPPB                 N-terminal_pro-BNP  P16860         4879
## SL006119:TFF3                               TFF3  Q07654         7033
## SL009324:FSTL3                             FSTL3  O95633        10272
## SL006527:EFEMP1                            FBLN3  Q12805         2202
## SL003869:GDF15                             MIC-1  Q99988         9518
## SL003320:FIGF                             VEGF-D  O43915         2277
## SL000052:TNNT2                        Troponin_T  P45379         7139
## SL001996:ANGPT2                   Angiopoietin-2  O15123          285
## SL007056:BMP10                             BMP10  O95393        27302
## SL000306:NPPB                             BNP-32  P16860         4879
## SL006610:ADAMTS13                          ATS13  Q76LX8        11093
## SL007206:THBS2                              TSP2  P35442         7058
## SL004646:LAYN                            Layilin  Q6UX15       143903
## SL004258:ADIPOQ                      Adiponectin  Q15848         9370
## SL005115:SPON1                         Spondin-1  Q9HCB6        10418
## SL009400:CHRDL1                            CRDL1  Q9BU40        91851
## SL011888:SMOC1                             SMOC1  Q9H4F8        64093
## SL004060:ECE1     Endothelin-converting_enzyme_1  P42892         1889
## SL003327:CFD                            Factor_D  P00746         1675
## SL008382:CST5                               CYTD  P28325         1473
## SL003324:F10               Coagulation_Factor_Xa  P00742         2159
## SL003770:SFRP1                            SARP-2  Q8N474         6422
## SL005230:UNC5C                            UNC5H3  O95185         8633
## SL005789:STC1                    Stanniocalcin-1  P52823         6781
## SL000592:TIMP2                            TIMP-2  P16035         7077
## SL007696:CD93                              C1QR1  Q9NPY3        22918
## SL001761:TNNI3                        Troponin_I  P19429         7137
## SL006924:AGRP                                ART  O00253          181
## SL005209:NOTCH3                          Notch-3  Q9UM47         4854
## SL007033:LTBP4                             LTBP4  Q8N2S1         8425
## SL010381:DKKL1                           Soggy-1  Q9UK85        27120
## SL002823:SELL                        sL-Selectin  P14151         6402
## SL006230:LUM                             Lumican  P51884         4060
## SL000462:IGFBP1                          IGFBP-1  P08833         3484
## SL002505:NPPA                                ANP  P01160         4878
## SL000087:IL6                                IL-6  P05231         3569
##                   EntrezGeneSymbol FlagA           Protein       logFC  AveExpr
## SL002785:NPPB                 NPPB  PASS     SL002785:NPPB  0.79681036 4.257362
## SL006119:TFF3                 TFF3  PASS     SL006119:TFF3  0.12252184 3.828077
## SL009324:FSTL3               FSTL3  PASS    SL009324:FSTL3  0.17666420 3.485938
## SL006527:EFEMP1             EFEMP1  PASS   SL006527:EFEMP1  0.12668089 2.823353
## SL003869:GDF15               GDF15  PASS    SL003869:GDF15  0.23869490 2.398806
## SL003320:FIGF                 FIGF  PASS     SL003320:FIGF  0.31157307 2.524268
## SL000052:TNNT2               TNNT2  PASS    SL000052:TNNT2  0.26394286 3.343274
## SL001996:ANGPT2             ANGPT2  PASS   SL001996:ANGPT2  0.29778799 2.116766
## SL007056:BMP10               BMP10  PASS    SL007056:BMP10  0.14511313 3.053178
## SL000306:NPPB                 NPPB  PASS     SL000306:NPPB  0.50848065 2.602311
## SL006610:ADAMTS13         ADAMTS13  PASS SL006610:ADAMTS13 -0.13719979 3.755110
## SL007206:THBS2               THBS2  PASS    SL007206:THBS2  0.23227681 3.701019
## SL004646:LAYN                 LAYN  PASS     SL004646:LAYN  0.18864544 2.935491
## SL004258:ADIPOQ             ADIPOQ  PASS   SL004258:ADIPOQ  0.31550991 3.274410
## SL005115:SPON1               SPON1  PASS    SL005115:SPON1  0.22874628 3.259998
## SL009400:CHRDL1             CHRDL1  PASS   SL009400:CHRDL1  0.10773789 3.249004
## SL011888:SMOC1               SMOC1  PASS    SL011888:SMOC1  0.16164263 3.401509
## SL004060:ECE1                 ECE1  PASS     SL004060:ECE1 -0.11222754 4.410959
## SL003327:CFD                   CFD  PASS      SL003327:CFD  0.13979577 2.900248
## SL008382:CST5                 CST5  PASS     SL008382:CST5  0.46872856 3.643594
## SL003324:F10                   F10  PASS      SL003324:F10 -0.12595674 3.790039
## SL003770:SFRP1               SFRP1  PASS    SL003770:SFRP1  0.33547700 3.162030
## SL005230:UNC5C               UNC5C  PASS    SL005230:UNC5C  0.10127373 3.400777
## SL005789:STC1                 STC1  PASS     SL005789:STC1  0.14287833 3.062556
## SL000592:TIMP2               TIMP2  PASS    SL000592:TIMP2  0.12737344 2.491127
## SL007696:CD93                 CD93  PASS     SL007696:CD93  0.25321915 3.988323
## SL001761:TNNI3               TNNI3  PASS    SL001761:TNNI3  0.43004405 2.892466
## SL006924:AGRP                 AGRP  PASS     SL006924:AGRP  0.19687481 3.404004
## SL005209:NOTCH3             NOTCH3  PASS   SL005209:NOTCH3  0.17949223 2.735815
## SL007033:LTBP4               LTBP4  PASS    SL007033:LTBP4  0.24517197 3.007896
## SL010381:DKKL1               DKKL1  PASS    SL010381:DKKL1 -0.08913164 3.519612
## SL002823:SELL                 SELL  PASS     SL002823:SELL -0.10073833 3.747984
## SL006230:LUM                   LUM  PASS      SL006230:LUM  0.13954849 2.987868
## SL000462:IGFBP1             IGFBP1  PASS   SL000462:IGFBP1  0.64242713 3.256480
## SL002505:NPPA                 NPPA  PASS     SL002505:NPPA  0.44974714 2.833070
## SL000087:IL6                   IL6  PASS      SL000087:IL6  0.34505690 2.361511
##                           t      P.Value    adj.P.Val Comparison
## SL002785:NPPB      3.255048 1.309033e-03 1.067680e-01 HF-Control
## SL006119:TFF3      2.104139 3.647995e-02 4.760633e-01 HF-Control
## SL009324:FSTL3     4.054383 6.935081e-05 1.131285e-02 HF-Control
## SL006527:EFEMP1    3.062156 2.465906e-03 1.532385e-01 HF-Control
## SL003869:GDF15     2.436193 1.562264e-02 3.376289e-01 HF-Control
## SL003320:FIGF      5.021685 1.044086e-06 4.137462e-04 HF-Control
## SL000052:TNNT2     2.954404 3.466729e-03 1.774969e-01 HF-Control
## SL001996:ANGPT2    4.393125 1.723218e-05 3.212570e-03 HF-Control
## SL007056:BMP10     2.505386 1.294226e-02 3.129725e-01 HF-Control
## SL000306:NPPB      3.601073 3.901509e-04 4.242891e-02 HF-Control
## SL006610:ADAMTS13 -2.521217 1.238986e-02 3.109378e-01 HF-Control
## SL007206:THBS2     2.397205 1.734025e-02 3.428641e-01 HF-Control
## SL004646:LAYN      2.440980 1.542246e-02 3.376289e-01 HF-Control
## SL004258:ADIPOQ    3.971442 9.625537e-05 1.395703e-02 HF-Control
## SL005115:SPON1     4.980009 1.268187e-06 4.137462e-04 HF-Control
## SL009400:CHRDL1    2.704883 7.357245e-03 2.671586e-01 HF-Control
## SL011888:SMOC1     2.470534 1.423635e-02 3.259375e-01 HF-Control
## SL004060:ECE1     -2.197500 2.900799e-02 4.301753e-01 HF-Control
## SL003327:CFD       2.153484 3.234718e-02 4.638799e-01 HF-Control
## SL008382:CST5      3.164242 1.770295e-03 1.274557e-01 HF-Control
## SL003324:F10      -1.994254 4.733506e-02 5.234936e-01 HF-Control
## SL003770:SFRP1     2.534467 1.194377e-02 3.109378e-01 HF-Control
## SL005230:UNC5C     2.013657 4.524007e-02 5.088846e-01 HF-Control
## SL005789:STC1      2.115893 3.545652e-02 4.760633e-01 HF-Control
## SL000592:TIMP2     2.613157 9.577879e-03 2.680377e-01 HF-Control
## SL007696:CD93      5.072210 8.235112e-07 4.137462e-04 HF-Control
## SL001761:TNNI3     2.848276 4.804179e-03 2.089818e-01 HF-Control
## SL006924:AGRP      2.035826 4.294332e-02 5.075789e-01 HF-Control
## SL005209:NOTCH3    2.497738 1.321691e-02 3.129725e-01 HF-Control
## SL007033:LTBP4     5.977009 8.858637e-09 1.156052e-05 HF-Control
## SL010381:DKKL1    -2.029754 4.356232e-02 5.075789e-01 HF-Control
## SL002823:SELL     -2.247749 2.556486e-02 4.044525e-01 HF-Control
## SL006230:LUM       3.153239 1.835411e-03 1.274557e-01 HF-Control
## SL000462:IGFBP1    3.016924 2.848259e-03 1.634302e-01 HF-Control
## SL002505:NPPA      3.449183 6.715464e-04 6.259772e-02 HF-Control
## SL000087:IL6       3.693566 2.778209e-04 3.295966e-02 HF-Control

3.6. Association with left ventricular ejection fraction

We took the full set of the protein data and we looked for candidates that are significantly correlated (Spearman correlation) with the left ventricular ejection fraction (LVEF). We found 147 proteins that are significantly associated with LVEF, 96 of which are also differentially expressed (Fisher test for enrichment P-value < 2.2e-16). The analysis also shows that 27 proteins LVEF correlated proteins with different variances.

# Spearman correlation of LVEF and protein expresison profiles 
cors <- LVEFcorr(Expr = data_CDCS$Expr,
                 Design = data_CDCS$Design,
                 Annotation = data_CDCS$Annotation)
head(cors)
##                    SomaId                        TargetFullName          Target
## SL000591:TIMP1   SL000591         Metalloproteinase_inhibitor_1          TIMP-1
## SL007696:CD93    SL007696     Complement_component_C1q_receptor           C1QR1
## SL000403:COL18A1 SL000403                            Endostatin      Endostatin
## SL005789:STC1    SL005789                       Stanniocalcin-1 Stanniocalcin-1
## SL004827:S100A6  SL004827                       Protein_S100-A6          S100A6
## SL007547:HAVCR2  SL007547 Hepatitis_A_virus_cellular_receptor_2           TIMD3
##                  UniProt EntrezGeneID EntrezGeneSymbol Flag       Spearman_rho
## SL000591:TIMP1    P01033         7076            TIMP1 FLAG -0.232591644241594
## SL007696:CD93     Q9NPY3        22918             CD93 PASS -0.231770111493184
## SL000403:COL18A1  P39060        80781          COL18A1 PASS -0.230708662101433
## SL005789:STC1     P52823         6781             STC1 PASS  -0.23037911581205
## SL004827:S100A6   P06703         6277           S100A6 PASS   0.22871792906718
## SL007547:HAVCR2   Q8TDQ0        84868           HAVCR2 PASS -0.227581824473883
##                               P.Value            adj.P.Val
## SL000591:TIMP1   4.47156451511159e-06 0.000201220403180022
## SL007696:CD93      4.842365537641e-06 0.000210642900887383
## SL000403:COL18A1 5.36503194296819e-06 0.000225844947925728
## SL005789:STC1    5.53796040890675e-06 0.000225844947925728
## SL004827:S100A6  6.49368561021418e-06 0.000256795749131197
## SL007547:HAVCR2  7.23571797494963e-06 0.000277723881097331
# store the data
# write.table(cors, "ProteomicsData/Cors_CDCS.txt", sep = "\t")

# number of proteins that are significantly correlated with LVEF
LVEFproteins_in_cdcs <- as.character(rownames(cors[which(as.numeric(as.character(cors$adj.P.Val)) <= 0.05 & cors$Flag == "PASS"), ]))
length(LVEFproteins_in_cdcs)
## [1] 147
# number of differentially expressed protein that are significantly correlated with LVEF 
length(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs))
## [1] 96
# enrichmnt of differentially expressed proteins that are significantly correlated with LVEF
mat <- matrix(c(length(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs)),
                length(intersect(LVEFproteins_in_cdcs, setdiff(rownames(DEs_CDCS), DEproteins_in_cdcs))),
                length(intersect(setdiff(rownames(cors), LVEFproteins_in_cdcs), DEproteins_in_cdcs)),
                length(intersect(setdiff(rownames(cors), LVEFproteins_in_cdcs), setdiff(rownames(DEs_CDCS), DEproteins_in_cdcs)))),
                nrow = 2)
colnames(mat) <- c("Corr_with_LVEF", "Uncorr_with_LVEF")
rownames(mat) <- c("DE", "non-DE")
mat
##        Corr_with_LVEF Uncorr_with_LVEF
## DE                 96              116
## non-DE             51             1042
fisher.test(mat)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  11.24168 25.49497
## sample estimates:
## odds ratio 
##   16.84028
# number of proteins with significantly correlation with LVEF and different variances across the two groups
length(intersect(LVEFproteins_in_cdcs,DVproteins_in_cdcs))
## [1] 27

We merged the results of the differential expression and LVEF correlation analysis in order to plot the most important proteins of our study so far:

# merge the differential expression and LVEF corrlelation
sl1 <- sort.list(rownames(DEs_CDCS))
sl2 <- sort.list(rownames(cors))
merged <- cbind(DEs_CDCS[sl1,1:12], cors[sl2, 8:10])
colnames(merged)[colnames(merged) == "P.Value"] <- paste(c("DE_", "LVEF_"), "P.Value", sep = "")
colnames(merged)[colnames(merged) == "adj.P.Val"] <- paste(c("DE_", "LVEF_"), "adj.P.Val", sep = "")

# data to plot
data <- LVEFdata(Expr = data_CDCS$Expr, 
                 Design = data_CDCS$Design,
                 Annotation = data_CDCS$Annotation,
                 merged_table = merged)

# the heatmap
LVEFheat(data = data)

# the smoothed expression of the protein clusters
LVEFsmooth(data = data)
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

The analysis identified 5 protein clusters with statistically significant correlation with LVEF (FDR < 5%). PGD protein is the most highly LVEF-correlated (Spearman’s rho = 0.276 and FDR = 6.67E-6). On the other hand, NT-proBNP is the most highly anti-correlated protein of the CDCS dataset (Spearman’s rho = -0.451 and FDR = 2.15E-17).

4. WGCNA analysis

We utilised the Weighted Gene Co-expression Network Analysis (WGCNA) pipeline to study the HF AND Control specific co-regulation networks of our proteins in CDCS. First we identify the power parameter that approximates scale-free topology for each network:

# first step of WGCNA analysis
wgcnaData <- WGCNAanalysis(Expr = data_CDCS$Expr,
                            Design = data_CDCS$Design,
                            Annotation = data_CDCS$Annotation,
                            DEs = DEs_CDCS,
                            de_cut = 0.05,
                            ppHF = 7,
                            ppC = 7)
## pickSoftThreshold: will use block size 1128.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 1128 of 1128
## Warning: executing %dopar% sequentially: no parallel backend registered
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.713 -1.160          0.883 154.000  1.38e+02 331.00
## 2      2    0.969 -1.280          0.961  41.000  2.67e+01 162.00
## 3      3    0.972 -1.200          0.978  16.500  6.72e+00 101.00
## 4      4    0.950 -1.120          0.972   8.680  2.11e+00  72.10
## 5      5    0.947 -1.060          0.972   5.390  7.84e-01  56.10
## 6      6    0.932 -1.020          0.957   3.700  3.21e-01  45.50
## 7      7    0.943 -0.993          0.967   2.710  1.35e-01  38.10
## 8      8    0.949 -0.972          0.964   2.080  6.19e-02  32.80
## 9      9    0.945 -0.973          0.948   1.640  2.88e-02  28.70
## 10    10    0.935 -0.985          0.925   1.320  1.44e-02  25.40
## 11    12    0.939 -0.981          0.926   0.904  3.75e-03  20.30
## 12    14    0.942 -1.020          0.930   0.649  9.57e-04  16.60
## 13    16    0.931 -1.050          0.913   0.482  2.68e-04  13.80
## 14    18    0.943 -1.050          0.928   0.368  7.86e-05  11.60
## 15    20    0.951 -1.060          0.938   0.287  2.30e-05   9.94
## pickSoftThreshold: will use block size 1128.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 1128 of 1128
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.628 -1.070          0.833 150.000  1.36e+02  328.0
## 2      2    0.920 -1.260          0.900  40.300  2.80e+01  162.0
## 3      3    0.894 -1.230          0.888  16.300  7.32e+00  104.0
## 4      4    0.895 -1.160          0.930   8.600  2.29e+00   75.5
## 5      5    0.892 -1.080          0.961   5.350  8.38e-01   59.1
## 6      6    0.895 -1.020          0.958   3.700  3.52e-01   48.7
## 7      7    0.895 -0.990          0.938   2.740  1.56e-01   41.5
## 8      8    0.908 -0.952          0.950   2.130  7.02e-02   36.0
## 9      9    0.929 -0.934          0.970   1.700  3.39e-02   31.6
## 10    10    0.924 -0.931          0.938   1.400  1.63e-02   28.1
## 11    12    0.945 -0.910          0.951   0.987  3.75e-03   22.7
## 12    14    0.921 -0.941          0.907   0.728  9.66e-04   18.8
## 13    16    0.959 -0.935          0.950   0.554  2.76e-04   15.8
## 14    18    0.935 -0.944          0.918   0.431  7.57e-05   13.5
## 15    20    0.924 -0.971          0.902   0.342  2.10e-05   11.6

The pipeline estimated the optimal power parameter = 7. Using it, we can start building the two co-regulation networks and their associated modules:

# generate the proetin modules
tf <- scan("ProteomicsData/TFs.txt", what = "character")
ef <- scan("ProteomicsData/Epigenetic_factors.txt", what = "characdter")
wgcnaModules <- WGCNAmodules(wgcnaData = wgcnaData, TFs = tf, EFs = ef)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

##  ..done.

##  ..done.

##  mergeCloseModules: Merging modules whose distance is less than 0.2
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 3 module eigengenes in given set.
##    Calculating new MEs...
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 3 module eigengenes in given set.

##  mergeCloseModules: Merging modules whose distance is less than 0.2
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 2 module eigengenes in given set.
##    Calculating new MEs...
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 2 module eigengenes in given set.

# HF protein modules
head(wgcnaModules$module_data_HF)
##                   SomaId                            TargetFullName
## SL002508:IL18BP SL002508            Interleukin-18-binding_protein
## SL004016:CXCL16 SL004016                  C-X-C_motif_chemokine_16
## SL009412:DKK3   SL009412                Dickkopf-related_protein_3
## SL008810:NEGR1  SL008810               Neuronal_growth_regulator_1
## SL005196:LSAMP  SL005196 Limbic_system-associated_membrane_protein
## SL009349:FSTL1  SL009349             Follistatin-related_protein_1
##                          Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF
## SL002508:IL18BP       IL-18_BPa  O95998        10068           IL18BP PASS    0
## SL004016:CXCL16 CXCL16,_soluble  Q9H2A7        58191           CXCL16 PASS    0
## SL009412:DKK3              DKK3  Q9UBP4        27122             DKK3 PASS    0
## SL008810:NEGR1            NEGR1  Q7Z3B1       257194            NEGR1 PASS    0
## SL005196:LSAMP            LSAMP  Q13449         4045            LSAMP PASS    0
## SL009349:FSTL1            FSTL1  Q12841        11167            FSTL1 PASS    0
##                 is.Epig           Module         Cor       CorSig
## SL002508:IL18BP       0 #581845_cor.Gene -0.53032037 1.617009e-14
## SL004016:CXCL16       0 #581845_cor.Gene -0.52258757 4.494488e-14
## SL009412:DKK3         0 #581845_cor.Gene -0.50958958 2.366226e-13
## SL008810:NEGR1        0 #581845_cor.Gene -0.47893843 9.099319e-12
## SL005196:LSAMP        0 #581845_cor.Gene -0.47619299 1.240037e-11
## SL009349:FSTL1        0 #581845_cor.Gene -0.46504952 4.235966e-11
##                       Connectivity TargetFullName.1 EntrezGeneID.1   ModuleName
## SL002508:IL18BP 0.0824199544642018              ---            --- ModAnnot_HF1
## SL004016:CXCL16  0.057539649687923              ---            --- ModAnnot_HF1
## SL009412:DKK3   0.0464891842044999              ---            --- ModAnnot_HF1
## SL008810:NEGR1    0.07045264757201              ---            --- ModAnnot_HF1
## SL005196:LSAMP  0.0521577814756525              ---            --- ModAnnot_HF1
## SL009349:FSTL1  0.0572776396950279              ---            --- ModAnnot_HF1
# Control protein modules
head(wgcnaModules$module_data_Control)
##                    SomaId                                      TargetFullName
## SL008486:LGALS9  SL008486                                          Galectin-9
## SL012517:RSPO4   SL012517                                         R-spondin-4
## SL004248:GDNF    SL004248         Glial_cell_line-derived_neurotrophic_factor
## SL011448:TNFRSF4 SL011448 Tumor_necrosis_factor_receptor_superfamily_member_4
## SL001802:IFNG    SL001802                                    Interferon_gamma
## SL002517:TNF     SL002517                               Tumor_necrosis_factor
##                  Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF is.Epig
## SL008486:LGALS9    LEG9  O00182         3965           LGALS9 PASS    0       0
## SL012517:RSPO4    RSPO4  Q2I0M5       343637            RSPO4 PASS    0       0
## SL004248:GDNF      GDNF  P39905         2668             GDNF PASS    0       0
## SL011448:TNFRSF4   TNR4  P43489         7293          TNFRSF4 PASS    0       0
## SL001802:IFNG     IFN-g  P01579         3458             IFNG PASS    0       0
## SL002517:TNF      TNF-a  P01375         7124              TNF PASS    0       0
##                            Module           Cor       CorSig       Connectivity
## SL008486:LGALS9  #A2E4ED_cor.Gene  0.7872894322 5.309984e-54 0.0482001294030122
## SL012517:RSPO4   #A2E4ED_cor.Gene  0.7589565065 4.113900e-48 0.0459888884934919
## SL004248:GDNF    #A2E4ED_cor.Gene  0.7446536119 1.936266e-45  0.101232304453775
## SL011448:TNFRSF4 #A2E4ED_cor.Gene  0.7370502018 4.328097e-44 0.0517663046295784
## SL001802:IFNG    #A2E4ED_cor.Gene  0.7212033993 2.011265e-41  0.037901995788316
## SL002517:TNF     #A2E4ED_cor.Gene  0.7158598012 1.450797e-40 0.0588539951482277
##                  TargetFullName.1 EntrezGeneID.1        ModuleName
## SL008486:LGALS9               ---            --- ModAnnot_Control1
## SL012517:RSPO4                ---            --- ModAnnot_Control1
## SL004248:GDNF                 ---            --- ModAnnot_Control1
## SL011448:TNFRSF4              ---            --- ModAnnot_Control1
## SL001802:IFNG                 ---            --- ModAnnot_Control1
## SL002517:TNF                  ---            --- ModAnnot_Control1

We identified 3 HF and 2 Control modules:

# HF modules
table(wgcnaModules$module_data_HF$ModuleName)
## 
## ModAnnot_HF1 ModAnnot_HF2 ModAnnot_HF3 
##          153          313          662
# Control modules
table(wgcnaModules$module_data_Control$ModuleName)
## 
## ModAnnot_Control1 ModAnnot_Control2 
##               824               304
# merge and save
mm <- match(rownames(wgcnaModules$module_data_HF), rownames(wgcnaModules$module_data_Control))
wgcnaModules$module_data_Control <- wgcnaModules$module_data_Control[mm, ]
merged_modules <- cbind(wgcnaModules$module_data_HF[,1:9],
                        Corr_HF = wgcnaModules$module_data_HF[,11], Corr_Control = wgcnaModules$module_data_Control[,11],
                        CorrSig_HF = wgcnaModules$module_data_HF[,12], CorrSig_Control = wgcnaModules$module_data_Control[,12],
                        Connectivity_HF = wgcnaModules$module_data_HF[,13], Connectivity_Control = wgcnaModules$module_data_Control[,13],
                        ModuleName_HF = wgcnaModules$module_data_HF[,16], ModuleName_Control = wgcnaModules$module_data_Control[,16])
head(merged_modules)
##                   SomaId                            TargetFullName
## SL002508:IL18BP SL002508            Interleukin-18-binding_protein
## SL004016:CXCL16 SL004016                  C-X-C_motif_chemokine_16
## SL009412:DKK3   SL009412                Dickkopf-related_protein_3
## SL008810:NEGR1  SL008810               Neuronal_growth_regulator_1
## SL005196:LSAMP  SL005196 Limbic_system-associated_membrane_protein
## SL009349:FSTL1  SL009349             Follistatin-related_protein_1
##                          Target UniProt EntrezGeneID EntrezGeneSymbol Flag isTF
## SL002508:IL18BP       IL-18_BPa  O95998        10068           IL18BP PASS    0
## SL004016:CXCL16 CXCL16,_soluble  Q9H2A7        58191           CXCL16 PASS    0
## SL009412:DKK3              DKK3  Q9UBP4        27122             DKK3 PASS    0
## SL008810:NEGR1            NEGR1  Q7Z3B1       257194            NEGR1 PASS    0
## SL005196:LSAMP            LSAMP  Q13449         4045            LSAMP PASS    0
## SL009349:FSTL1            FSTL1  Q12841        11167            FSTL1 PASS    0
##                 is.Epig     Corr_HF  Corr_Control   CorrSig_HF CorrSig_Control
## SL002508:IL18BP       0 -0.53032037  0.4307792680 1.617009e-14    1.019809e-12
## SL004016:CXCL16       0 -0.52258757  0.2245546505 4.494488e-14    3.455239e-04
## SL009412:DKK3         0 -0.50958958  0.2195649562 2.366226e-13    4.705256e-04
## SL008810:NEGR1        0 -0.47893843  0.3431590868 9.099319e-12    2.567341e-08
## SL005196:LSAMP        0 -0.47619299  0.2652653578 1.240037e-11    2.141807e-05
## SL009349:FSTL1        0 -0.46504952  0.2911335354 4.235966e-11    2.841307e-06
##                    Connectivity_HF Connectivity_Control ModuleName_HF
## SL002508:IL18BP 0.0824199544642018   0.0490480657575382  ModAnnot_HF1
## SL004016:CXCL16  0.057539649687923   0.0294019307130613  ModAnnot_HF1
## SL009412:DKK3   0.0464891842044999   0.0330780010007111  ModAnnot_HF1
## SL008810:NEGR1    0.07045264757201   0.0551180917312804  ModAnnot_HF1
## SL005196:LSAMP  0.0521577814756525   0.0416699951520157  ModAnnot_HF1
## SL009349:FSTL1  0.0572776396950279    0.044316466702459  ModAnnot_HF1
##                 ModuleName_Control
## SL002508:IL18BP  ModAnnot_Control1
## SL004016:CXCL16  ModAnnot_Control1
## SL009412:DKK3    ModAnnot_Control1
## SL008810:NEGR1   ModAnnot_Control1
## SL005196:LSAMP   ModAnnot_Control1
## SL009349:FSTL1   ModAnnot_Control1
# write.table(merged_modules, "ProteomicsData/Modules.txt", sep = "\t")

Further analysis showed that module Control 1 is split into HF1 (149 proteins; ~18.1%) and HF3 (646 proteins; ~78.4%). Module Control 2 is very similar to HF2 (284 proteins; ~93.4%):

# the proteins of each module 
c1 <- rownames(wgcnaModules$module_data_Control)[wgcnaModules$module_data_Control$ModuleName == "ModAnnot_Control1"]
c2 <- rownames(wgcnaModules$module_data_Control)[wgcnaModules$module_data_Control$ModuleName == "ModAnnot_Control2"]
hf1 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF1"]
hf2 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF2"]
hf3 <- rownames(wgcnaModules$module_data_HF)[wgcnaModules$module_data_HF$ModuleName == "ModAnnot_HF3"]

# intersection of Control 1 with all HF modules
length(intersect(c1, hf1)) / length(c1)
## [1] 0.1808252
length(intersect(c1, hf2)) / length(c1)
## [1] 0.03519417
length(intersect(c1, hf3)) / length(c1)
## [1] 0.7839806
# intersection of Control 2 with all HF modules
length(intersect(c2, hf1)) / length(c2)
## [1] 0.01315789
length(intersect(c2, hf2)) / length(c2)
## [1] 0.9342105
length(intersect(c2, hf3)) / length(c2)
## [1] 0.05263158

4.1. Differential co-expression networks

We applied the WGCNA pipeline of Fuller et al. Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007;18:463-72 to identify protein groups with statistically significant network connectivity and differential expression between HF and Control. Our strategy is as follows:

  • Record the HF and Controlgroup connectivities, \(K_{HF}\) and \(K_{Control}\), the connectivity difference \(DiffK = K_{HF} - K_{Control}\) and the limma t-test statistics and FDRs of the 1128 QC-passed proteins of CDCS.

  • Permute the HF and Control labels in order to generate new random patient groups.

  • Run the WGCNA analysis to estimate the permuted group connectivities \(K^*_{HF}\) and \(K^*_{Control}\) and calculate the connectivity difference \(DiffK^* = K^*_{HF} - K^*_{Control}\).

  • Iterate B = 50,000 times.

  • Evaluate the connectivity \(DiffK\) significance by the resampling p-value: \(P = #(DiffK^* > DiffK) / B\).

# collect the diffk = difference in connectivity (HF - Control)
diffk <- cbind(colnames(wgcnaData$wgcnaData_HF$Data), wgcnaData$DiffK)

# calculate the permutation statistics B = 50,000 times
stats_star <- wgcnaStatsFromPermutation(wgcnaData = wgcnaData, B = 50000)

# evaluate the differential co-expression networks
dina <- summariseDiNA(wgcnaData = wgcnaData,
                      wgcnaModules = wgcnaModules,
                      DiffK = diffk,
                      DEs = DEs_CDCS,
                      de.fdr.cut = 0.05,
                      diffk.cut = 0.1,
                      diffk.fdr.cut = 0.05,
                      permStats = stats_star[[1]])

# store the results
# write.table(dina, "ProteomicsData/DiNA.txt", sep = "\t")

This approach extracts the differentially expressed proteins and the proteins that show differential connectivity between HF and Control at FDR = 5%. The significant proteins are the following:

# load the pre-calculated data (several hours of computing)
dina <- read.table("ProteomicsData/DiNA.txt", sep = "\t", comment.char = "!")

# the significant proteins
dina_signf <- dina[dina$DiNA_significance == "Y", ]
dina_signf
##                     SomaId
## SL004557:CD59     SL004557
## SL008631:DSC2     SL008631
## SL001777:CST3     SL001777
## SL004140:EFNA4    SL004140
## SL004141:EFNA5    SL004141
## SL001992:TNFRSF1A SL001992
## SL005172:IGFBP6   SL005172
## SL005193:JAM2     SL005193
## SL005194:JAM3     SL005194
## SL001800:TNFRSF1B SL001800
## SL004151:IL15RA   SL004151
## SL000283:B2M      SL000283
## SL006119:TFF3     SL006119
## SL002654:EPHA2    SL002654
## SL005213:RELT     SL005213
##                                                          TargetFullName
## SL004557:CD59                                         CD59_glycoprotein
## SL008631:DSC2                                             Desmocollin-2
## SL001777:CST3                                                Cystatin-C
## SL004140:EFNA4                                                Ephrin-A4
## SL004141:EFNA5                                                Ephrin-A5
## SL001992:TNFRSF1A  Tumor_necrosis_factor_receptor_superfamily_member_1A
## SL005172:IGFBP6            Insulin-like_growth_factor-binding_protein_6
## SL005193:JAM2                            Junctional_adhesion_molecule_B
## SL005194:JAM3                            Junctional_adhesion_molecule_C
## SL001800:TNFRSF1B  Tumor_necrosis_factor_receptor_superfamily_member_1B
## SL004151:IL15RA                   Interleukin-15_receptor_subunit_alpha
## SL000283:B2M                                       Beta-2-microglobulin
## SL006119:TFF3                                          Trefoil_factor_3
## SL002654:EPHA2                                 Ephrin_type-A_receptor_2
## SL005213:RELT     Tumor_necrosis_factor_receptor_superfamily_member_19L
##                                   Target UniProt EntrezGeneID EntrezGeneSymbol
## SL004557:CD59                       CD59  P13987          966             CD59
## SL008631:DSC2                       DSC2  Q02487         1824             DSC2
## SL001777:CST3                 Cystatin_C  P01034         1471             CST3
## SL004140:EFNA4                 Ephrin-A4  P52798         1945            EFNA4
## SL004141:EFNA5                 Ephrin-A5  P52803         1946            EFNA5
## SL001992:TNFRSF1A               TNF_sR-I  P19438         7132         TNFRSF1A
## SL005172:IGFBP6                  IGFBP-6  P24592         3489           IGFBP6
## SL005193:JAM2                      JAM-B  P57087        58494             JAM2
## SL005194:JAM3                      JAM-C  Q9BX67        83700             JAM3
## SL001800:TNFRSF1B              TNF_sR-II  P20333         7133         TNFRSF1B
## SL004151:IL15RA                 IL-15_Ra  Q13261         3601           IL15RA
## SL000283:B2M            b2-Microglobulin  P61769          567              B2M
## SL006119:TFF3                       TFF3  Q07654         7033             TFF3
## SL002654:EPHA2    Epithelial_cell_kinase  P29317         1969            EPHA2
## SL005213:RELT                       RELT  Q969Z4        84957             RELT
##                   Flag isTF is.Epig         Cor       CorSig   ModuleName
## SL004557:CD59     PASS    0       0 -0.37699986 1.684187e-07 ModAnnot_HF1
## SL008631:DSC2     PASS    0       0 -0.31974845 1.145474e-05 ModAnnot_HF1
## SL001777:CST3     PASS    0       0 -0.12333575 9.809936e-02 ModAnnot_HF1
## SL004140:EFNA4    PASS    0       0 -0.36561935 4.167091e-07 ModAnnot_HF1
## SL004141:EFNA5    PASS    0       0 -0.33355510 4.463717e-06 ModAnnot_HF1
## SL001992:TNFRSF1A PASS    0       0 -0.24723616 7.924368e-04 ModAnnot_HF1
## SL005172:IGFBP6   PASS    0       0 -0.14328861 5.431233e-02 ModAnnot_HF1
## SL005193:JAM2     PASS    0       0 -0.17824612 1.636377e-02 ModAnnot_HF1
## SL005194:JAM3     PASS    0       0  0.08706781 2.438260e-01 ModAnnot_HF2
## SL001800:TNFRSF1B PASS    0       0 -0.30113489 3.799977e-05 ModAnnot_HF1
## SL004151:IL15RA   PASS    0       0 -0.32955902 5.891338e-06 ModAnnot_HF1
## SL000283:B2M      PASS    0       0 -0.11780828 1.142277e-01 ModAnnot_HF1
## SL006119:TFF3     PASS    0       0 -0.30672398 2.673419e-05 ModAnnot_HF1
## SL002654:EPHA2    PASS    0       0 -0.33424636 4.252822e-06 ModAnnot_HF1
## SL005213:RELT     PASS    0       0 -0.35547883 9.075413e-07 ModAnnot_HF1
##                       logFC     Ttest        DE_PV       DE_FDR      DiffK
## SL004557:CD59     0.1351539 0.1351539 6.431785e-05 1.199069e-03  0.1739577
## SL008631:DSC2     0.2210540 0.2210540 1.178876e-07 5.917050e-06  0.1878401
## SL001777:CST3     0.2552607 0.2552607 1.443106e-11 9.416268e-09  0.1541372
## SL004140:EFNA4    0.2116332 0.2116332 2.088071e-08 1.362466e-06  0.1959876
## SL004141:EFNA5    0.2069690 0.2069690 6.266176e-07 2.336389e-05  0.1453081
## SL001992:TNFRSF1A 0.2325912 0.2325912 1.802758e-07 8.112410e-06  0.1989524
## SL005172:IGFBP6   0.1332245 0.1332245 2.818557e-04 3.678217e-03  0.1192140
## SL005193:JAM2     0.1420750 0.1420750 8.817774e-07 3.110053e-05  0.1522122
## SL005194:JAM3     0.1514172 0.1514172 5.155150e-04 5.701246e-03 -0.3265702
## SL001800:TNFRSF1B 0.1388130 0.1388130 6.963250e-04 6.781374e-03  0.1440474
## SL004151:IL15RA   0.1743670 0.1743670 7.361769e-05 1.334321e-03  0.1196379
## SL000283:B2M      0.2929166 0.2929166 3.352110e-09 4.910032e-07  0.1544569
## SL006119:TFF3     0.3352670 0.3352670 4.552645e-11 1.980400e-08  0.1637671
## SL002654:EPHA2    0.2535444 0.2535444 1.733899e-08 1.257077e-06  0.1724334
## SL005213:RELT     0.2299626 0.2299626 2.279386e-07 9.595480e-06  0.2201945
##                   DiffK_PV  DiffK_FDR Comparison DiNA_significance
## SL004557:CD59      0.00014 0.01611429 HF-Control                 Y
## SL008631:DSC2      0.00018 0.01611429 HF-Control                 Y
## SL001777:CST3      0.00038 0.01863652 HF-Control                 Y
## SL004140:EFNA4     0.00056 0.02423111 HF-Control                 Y
## SL004141:EFNA5     0.00032 0.01863652 HF-Control                 Y
## SL001992:TNFRSF1A  0.00020 0.01611429 HF-Control                 Y
## SL005172:IGFBP6    0.00066 0.02567172 HF-Control                 Y
## SL005193:JAM2      0.00000 0.00000000 HF-Control                 Y
## SL005194:JAM3      0.00120 0.03760000 HF-Control                 Y
## SL001800:TNFRSF1B  0.00016 0.01611429 HF-Control                 Y
## SL004151:IL15RA    0.00038 0.01863652 HF-Control                 Y
## SL000283:B2M       0.00026 0.01833000 HF-Control                 Y
## SL006119:TFF3      0.00016 0.01611429 HF-Control                 Y
## SL002654:EPHA2     0.00010 0.01611429 HF-Control                 Y
## SL005213:RELT      0.00002 0.01128000 HF-Control                 Y
# number of significant proteins by module
tab <- table(dina$ModuleName, dina$DiNA_significance)
tab
##               
##                  N   Y
##   ModAnnot_HF1 139  14
##   ModAnnot_HF2 312   1
##   ModAnnot_HF3 662   0
# enrichment of HF1
mat <- cbind(tab[1, ], apply(tab[2:3, ], 2, sum))
fisher.test(mat)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value = 5.554e-12
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.0002469349 0.0684075100
## sample estimates:
## odds ratio 
## 0.01026009

The analysis highlighted 15 proteins that are both differentially expressed and differentially connected (DiNA). All but one of them come from the HF1 module. The HF1 module is statistically over-represented with significant proteins (Fisher test P = 5.554e-12). The plot below shows these proteins in blue across the logFC and DiffK axes:

plotDiNA(DiNAdata = dina)  
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

5. Integration with single-cell RNA-seq (Transcriptomics)

In this paragraph, we will briefly show the integration of the proteomic and the single-cell RNA-seq results. We will use the processed data of 3 single-cell and 1 single-nuclei RNA-seq experiments. The processing has been described in our paper. The raw data can be downloaded from the respective publications.

First, we load the normalised expression profiles of the 440 Sham and 390 MI high-quality cardiac fibroblasts from 3 Sham and 3 MI mice (nonMyo_expression), the design of the experiment (nonMyo_design) and the differentially expressed genes of the Sham vs MI comparison estimated by Seurat (nonMyo_de). We also consider the differentially expressed proteins and the LVEF correlations of the CDCS data analysis.

# load the RNA-seq data
load("nonMyo.RData")

# contents of the .Rda file
names(nonMyo)
## [1] "nonMyo_expression" "nonMyo_design"     "nonMyo_de"
# stored differentially expressed proteins
p_DE <- read.table("ProteomicsData/cands_CDCSonly.txt", sep = "\t", header = T)

# the correlations are read from the 'cors' variable above

The function below intersects the differentially expressed proteins and all genes from the non-myocyte sincle-cell RNA-seq experiment and plots the latter expression profiles in a heatmap. Gene names in green are differentially expressed in both datasets (FDR<10% and |logFC| >1 in the single-cell experiment), gene names in khaki are differentially expressed in the single-cell dataset only with the same logFC signs in the two datasets and gene names in gray have different logFC signs in the two datasets independently of differential expression. The right-hand side plots the somalogic logFCs (HF - Control) highlighting in cyan the differentially expressed proteins at FDR < 0.1%.

# Integration of plasma proteomics and sc fibroblasts
nonmyo <- scrna_nonmyocytes(scExpr = nonMyo$nonMyo_expression,
                            scDesign = nonMyo$nonMyo_design,
                            scDE = nonMyo$nonMyo_de,
                            pDE = p_DE,
                            lvefStats = cors)

In the same fashion, we analyse the other 3 available datasets. We load and integrate the TPM normalised expression profiles of mouse cardiomyocytes from 85 Sham and 100 TAC high-quality single nuclei (snCardiomyocytes_expression), the design of the experiment (snCardiomyocytes_design) and the differentially expressed genes of Sham vs TAC by edgeR (snCardiomyocytes_de).

# load the RNA-seq data
load("snMyo.RData")

# contents of the .Rda file
names(snMyo)
## [1] "snCardiomyocytes_expression" "snCardiomyocytes_design"    
## [3] "snCardiomyocytes_de"
# Integration of plasma proteomics and sn myocytes
myo1 <- scrna_myocytes(scExpr = snMyo$snCardiomyocytes_expression,
                       scDesign = snMyo$snCardiomyocytes_design,
                       scDE = snMyo$snCardiomyocytes_de,
                       pDE = p_DE,
                       lvefStats = cors)

We load and integrate the TPM normalised expression profiles the time course mouse myocytes dataset of 88 Sham, 69 day-3 TAC, 83 day-7 TAC and 73 4-weeks TAC. As differentially expressed, we consider those genes with FDR < 10% and |logFC| > 1 in the Sham vs TAC of at least one time point.

# load the RNA-seq data
load("tsMyo.RData")

# contents of the .Rda file
names(tsMyo)
## [1] "tsCardiomyocytes_expression" "tsCardiomyocytes_design"    
## [3] "tsCardiomyocytes_de"
# Integration of plasma proteomics and sc myocytes of the time-series experiment 
myo2 <- scrna_myocytes_time(scExpr = tsMyo$tsCardiomyocytes_expression,
                       scDesign = tsMyo$tsCardiomyocytes_design,
                       scDE = tsMyo$tsCardiomyocytes_de,
                       pDE = p_DE,
                       lvefStats = cors)

Finally, we load and integrate the human myocytes dataset of 71 Control and 488 DCM single cells:

# load the RNA-seq data
load("hMyo.RData")

# contents of the .Rda file
names(hMyo)
## [1] "hCardiomyocytes_expression" "hCardiomyocytes_design"    
## [3] "hCardiomyocytes_de"
# Integration of plasma proteomics and sc human myocytes
hmyo <- scrna_humanMyocytes(scExpr = hMyo$hCardiomyocytes_expression,
                            scDesign = hMyo$hCardiomyocytes_design,
                            scDE = hMyo$hCardiomyocytes_de,
                            pDE = p_DE,
                            lvefStats = cors)

6. Summaries of data integration

In this paragraph we will summarise the main findings of this study. We analysed the data of two protein cohorts and 4 single-cell RNA-seq datasets and integrated the findings in order to prioritise a list of proteins for further experimental validations. In CDCS we found 1128 proteins that passed the QC criteria:

# QC in CDCS
table(data_CDCS$Annotation$Flag)
## 
## FLAG PASS 
##  177 1128

The differential expression analysis identified 212 significant proteins among them:

# number of differentially expressed proteins in CDCS
length(DEproteins_in_cdcs)
## [1] 212

We intersected these results with those of LVEF correlation analysis to find 96 differentially expressed and LVEF significantly correlated proteins:

# number of differentially expressed and LVEF significantly correlated proteins in CDCS
leg1_lvef <- intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs)
length(leg1_lvef)
## [1] 96

Finally, we run WGCNA analysis to find a robust set of significant proteins that exhibit all the above criteria and have unique co-regulation patterns in HF compared to Control. We ended up with 12 candidates:

# number of differentially expressed, LVEF significantly correlated & HF uniquely co-regulated proteins in CDCS  
leg1_DiNA <- intersect(intersect(LVEFproteins_in_cdcs, DEproteins_in_cdcs), rownames(dina_signf)[dina_signf$ModuleName == "ModAnnot_HF1"])
length(leg1_DiNA)
## [1] 12

We integrated the above results with those of the differential expression analysis of four independent single-cell RNA-seq datasets. In the mouse cardiac non-myocytes dataset we found:

We considered as consisten the green and khaki candidates that passed the CDCS QC. We grouped together the consistent results and found that they are enriched among the 467 differentially expressed genes of the non-myocytes study (Fisher test P = 1.253e-3):

# agreement between CDCS and non-myocyte scRNA-seq differential expression 
table(nonmyo[[1]][, 4], nonmyo[[1]][, 5])
##        
##         FLAG PASS
##   gray     1    7
##   green    4    6
##   khaki    0   28
# frequency table and enrichment of consistent findings
nonmyo[[3]]
##            Consistent Inconsistent
## Common             34           12
## Non-common        433          455
fisher.test(nonmyo[[3]])
## 
##  Fisher's Exact Test for Count Data
## 
## data:  nonmyo[[3]]
## p-value = 0.001253
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.478916 6.393639
## sample estimates:
## odds ratio 
##   2.974047

In the same way, we integrated the results of CDCS and the other single-cell datasets. In the mouse cardiomyocytes of TAC vs Control we found that 13 genes of the Green set passed CDCS’s QC and a highly significant enrichment of consistent hits among the 713 differentially expressed genes (Fisher test P = 6.807e-09).

# agreement between CDCS and myocyte scRNA-seq differential expression 
table(myo1[[1]][, 4], myo1[[1]][, 5])
##        
##         FLAG PASS
##   gray     1    3
##   green    3   13
##   khaki    4   25
# frequency table and enrichment of consistent findings
myo1[[3]]
##            Consistent Inconsistent
## Common             38            3
## Non-common        675          710
fisher.test(myo1[[3]])
## 
##  Fisher's Exact Test for Count Data
## 
## data:  myo1[[3]]
## p-value = 6.807e-09
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   4.191496 67.579692
## sample estimates:
## odds ratio 
##   13.30144

In the mouse cardiomyocytes TAC vs Control time-course experiment we found 10 Green set genes that passed CDCS’s QC and a highly significant enrichment of consistent hits among the 3022 differentially expressed genes of any of the pairwise comparisons (Fisher test P = 1.809e-10). In addition, we found that such a highly significant enrichment is reproduced among the LVEF associated proteins (Fisher test P = 1.42e-5).

# agreement between CDCS and myocyte scRNA-seq differential expression 
table(myo2[[1]][, 4], myo2[[1]][, 5])
##        
##         FLAG PASS
##   gray     1    5
##   green    4   10
##   khaki    5   40
# frequency table and enrichment of consistent findings
myo2[[3]]
##            Consistent Inconsistent
## Common             50            5
## Non-common       2972         3017
fisher.test(myo2[[3]])
## 
##  Fisher's Exact Test for Count Data
## 
## data:  myo2[[3]]
## p-value = 1.809e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   4.062894 32.665758
## sample estimates:
## odds ratio 
##   10.14807

Finally, in the human cardiomyocytes DCM vs Control study we found 11 Green set genes that passed CDCS’s QC and a highly significant enrichment of consistent hits among the 390 differentially expressed genes of the study (Fisher test P = 3.487e-08):

# agreement between CDCS and myocyte scRNA-seq differential expression 
table(hmyo[[1]][, 4], hmyo[[1]][, 5])
##        
##         FLAG PASS
##   gray     2    5
##   green    0   11
##   khaki    4   29
# frequency table and enrichment of consistent findings
hmyo[[3]]
##            Consistent Inconsistent
## Common             40            5
## Non-common        350          385
fisher.test(hmyo[[3]])
## 
##  Fisher's Exact Test for Count Data
## 
## data:  hmyo[[3]]
## p-value = 3.487e-08
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   3.411695 28.838281
## sample estimates:
## odds ratio 
##   8.780463

We collected the results of all-single-cell analysis and found 128 unique candidates (127 if we count NPPB only once with respect to gene names) of which 24 (23 if we count NPPB only once with respect to gene names) were significantly correlated with LVEF.

# green/khaki QC-passed genes found in at least one dataset 
set1 <- as.character(nonmyo[[1]][nonmyo[[1]][, 4] == "green" & nonmyo[[1]][, 5] == "PASS" | nonmyo[[1]][, 4] == "khaki" & nonmyo[[1]][, 5] == "PASS", 1])
set1 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[,6]), GeneNames(toupper(set1)), nomatch = 0) > 0 ]
set2 <- as.character(myo1[[1]][myo1[[1]][, 4] == "green" & myo1[[1]][, 5] == "PASS" | myo1[[1]][, 4] == "khaki" & myo1[[1]][, 5] == "PASS", 1])
set2 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), GeneNames(toupper(set2)), nomatch = 0) > 0 ]
set3 <- as.character(myo2[[1]][myo2[[1]][, 4] == "green" & myo2[[1]][, 5] == "PASS" | myo2[[1]][, 4] == "khaki" & myo2[[1]][, 5] == "PASS", 1])
set3 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), toupper(set3), nomatch = 0) > 0 ]
set4 <- as.character(hmyo[[1]][hmyo[[1]][, 4] == "green" & hmyo[[1]][, 5] == "PASS" | hmyo[[1]][, 4] == "khaki" & hmyo[[1]][, 5] == "PASS", 1])
set4 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), set4, nomatch = 0) > 0 ]
consistent_set <- unique(c(toupper(set1), 
                       toupper(set2), 
                       toupper(set3), 
                       set4))
length(consistent_set)
## [1] 128
# intersection with LVEF
leg2_consistent_lvef <- intersect(consistent_set, LVEFproteins_in_cdcs)
length(leg2_consistent_lvef)
## [1] 24

Of those, we prioritised for further analysis and experimental validation only the 30 green candidates (differentially expressed in both proteomic and transcriptomic data), of which 15 have protein counterparts that are significantly correlated with LVEF:

# green QC-passed genes found in at least one dataset 
set1 <- as.character(nonmyo[[1]][nonmyo[[1]][, 4] == "green" & nonmyo[[1]][, 5] == "PASS", 1])
set1 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[,6]), GeneNames(toupper(set1)), nomatch = 0) > 0 ]
set2 <- as.character(myo1[[1]][myo1[[1]][, 4] == "green" & myo1[[1]][, 5] == "PASS", 1])
set2 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), GeneNames(toupper(set2)), nomatch = 0) > 0 ]
set3 <- as.character(myo2[[1]][myo2[[1]][, 4] == "green" & myo2[[1]][, 5] == "PASS", 1])
set3 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), toupper(set3), nomatch = 0) > 0 ]
set4 <- as.character(hmyo[[1]][hmyo[[1]][, 4] == "green" & hmyo[[1]][, 5] == "PASS", 1])
set4 <- rownames(data_CDCS$Annotation)[match(as.character(data_CDCS$Annotation[, 6]), set4, nomatch = 0) > 0 ]
prioritised_set <- unique(c(toupper(set1), 
                       toupper(set2), 
                       toupper(set3), 
                       set4))
length(prioritised_set)
## [1] 30
# intersection with LVEF
leg2_prioritised <- intersect(prioritised_set, LVEFproteins_in_cdcs)
length(leg2_prioritised)
## [1] 15

To obtain the final list of prioritised proteins for experimental validation, we found common proteins of all pairwise comparisons of three sets:

leg1_prioritised <- intersect(leg1_lvef, rownames(dina)[dina$ModuleName == "ModAnnot_HF1"])
leg1_prioritised
##  [1] "SL007696:CD93"        "SL000403:COL18A1"     "SL007547:HAVCR2"     
##  [4] "SL005152:RARRES2"     "SL018509:RSPO3"       "SL006527:EFEMP1"     
##  [7] "SL010465:MATN2"       "SL005357:REG1A"       "SL000695:LCN2"       
## [10] "SL005430:SECTM1"      "SL009400:CHRDL1"      "SL004646:LAYN"       
## [13] "SL019019:PIANP"       "SL004141:EFNA5"       "SL001800:TNFRSF1B"   
## [16] "SL000466:IGFBP2"      "SL000525:MMP7"        "SL007306:FAM3B"      
## [19] "SL011888:SMOC1"       "SL004557:CD59"        "SL005230:UNC5C"      
## [22] "SL011509:PYY"         "SL008486:LGALS9"      "SL012561:REG4"       
## [25] "SL004151:IL15RA"      "SL003190:CCL15"       "SL003970:PTH"        
## [28] "SL002506:PLAUR"       "SL007033:LTBP4"       "SL000024:F3"         
## [31] "SL007056:BMP10"       "SL007274:EFNB2"       "SL011448:TNFRSF4"    
## [34] "SL004149:IL13RA1"     "SL016148:Human-virus" "SL011068:IL17RC"     
## [37] "SL009207:DCTN2"       "SL006924:AGRP"        "SL004458:PI3"        
## [40] "SL003915:KLK8"        "SL008372:FLRT3"       "SL003060:FGFR1"      
## [43] "SL000658:GAS1"        "SL004140:EFNA4"       "SL008099:CAPG"       
## [46] "SL006119:TFF3"        "SL002785:NPPB"        "SL001774:FABP3"      
## [49] "SL009324:FSTL3"       "SL001777:CST3"        "SL007206:THBS2"      
## [52] "SL001888:SLPI"        "SL000002:VEGFA"       "SL003869:GDF15"      
## [55] "SL000052:TNNT2"       "SL001996:ANGPT2"      "SL003320:FIGF"       
## [58] "SL002654:EPHA2"       "SL001992:TNFRSF1A"    "SL000283:B2M"        
## [61] "SL005213:RELT"        "SL008631:DSC2"        "SL003329:CCL14"
leg2_prioritised
##  [1] "SL001996:ANGPT2"  "SL004149:IL13RA1" "SL007206:THBS2"   "SL007033:LTBP4"  
##  [5] "SL001777:CST3"    "SL000283:B2M"     "SL000306:NPPB"    "SL001774:FABP3"  
##  [9] "SL002505:NPPA"    "SL002785:NPPB"    "SL000695:LCN2"    "SL007207:THBS4"  
## [13] "SL009324:FSTL3"   "SL000052:TNNT2"   "SL001761:TNNI3"
immaculate_set <- rownames(meta)[meta$Validated == TRUE]
immaculate_set
##  [1] "SL002785:NPPB"     "SL006119:TFF3"     "SL009324:FSTL3"   
##  [4] "SL006527:EFEMP1"   "SL003869:GDF15"    "SL003320:FIGF"    
##  [7] "SL000052:TNNT2"    "SL001996:ANGPT2"   "SL007056:BMP10"   
## [10] "SL000306:NPPB"     "SL006610:ADAMTS13" "SL007206:THBS2"   
## [13] "SL004646:LAYN"     "SL004258:ADIPOQ"   "SL005115:SPON1"   
## [16] "SL009400:CHRDL1"   "SL011888:SMOC1"    "SL004060:ECE1"    
## [19] "SL003327:CFD"      "SL008382:CST5"     "SL003324:F10"     
## [22] "SL003770:SFRP1"    "SL005230:UNC5C"    "SL005789:STC1"    
## [25] "SL000592:TIMP2"    "SL007696:CD93"     "SL001761:TNNI3"   
## [28] "SL006924:AGRP"     "SL005209:NOTCH3"   "SL007033:LTBP4"   
## [31] "SL010381:DKKL1"    "SL002823:SELL"     "SL006230:LUM"     
## [34] "SL000462:IGFBP1"   "SL002505:NPPA"     "SL000087:IL6"

The analysis prioritised 25 proteins: 6 common among all sets, 11 common between the CDCS and the IMMACULATE sets, 3 common between the IMMACULATE and the single-cell RNA-seq set and 5 common between the CDCS and the single-cell RNA-seq set. Of them, 7 are differentially expressed in the IMMACULATE cohort at FDR = 10%:

# venn diagram
v <- venn(list(cdcs = leg1_prioritised, scRNAseq = leg2_prioritised, immaculate = immaculate_set))

# proteins of the venn diagram
attributes(v)$intersections
## $immaculate
##  [1] "SL006610:ADAMTS13" "SL004258:ADIPOQ"   "SL005115:SPON1"   
##  [4] "SL004060:ECE1"     "SL003327:CFD"      "SL008382:CST5"    
##  [7] "SL003324:F10"      "SL003770:SFRP1"    "SL005789:STC1"    
## [10] "SL000592:TIMP2"    "SL005209:NOTCH3"   "SL010381:DKKL1"   
## [13] "SL002823:SELL"     "SL006230:LUM"      "SL000462:IGFBP1"  
## [16] "SL000087:IL6"     
## 
## $`cdcs:scRNAseq`
## [1] "SL000695:LCN2"    "SL004149:IL13RA1" "SL001774:FABP3"   "SL001777:CST3"   
## [5] "SL000283:B2M"    
## 
## $`cdcs:immaculate`
##  [1] "SL007696:CD93"   "SL006527:EFEMP1" "SL009400:CHRDL1" "SL004646:LAYN"  
##  [5] "SL011888:SMOC1"  "SL005230:UNC5C"  "SL007056:BMP10"  "SL006924:AGRP"  
##  [9] "SL006119:TFF3"   "SL003869:GDF15"  "SL003320:FIGF"  
## 
## $`scRNAseq:immaculate`
## [1] "SL000306:NPPB"  "SL002505:NPPA"  "SL001761:TNNI3"
## 
## $`cdcs:scRNAseq:immaculate`
## [1] "SL007033:LTBP4"  "SL002785:NPPB"   "SL009324:FSTL3"  "SL007206:THBS2" 
## [5] "SL000052:TNNT2"  "SL001996:ANGPT2"
## 
## $cdcs
##  [1] "SL000403:COL18A1"     "SL007547:HAVCR2"      "SL005152:RARRES2"    
##  [4] "SL018509:RSPO3"       "SL010465:MATN2"       "SL005357:REG1A"      
##  [7] "SL005430:SECTM1"      "SL019019:PIANP"       "SL004141:EFNA5"      
## [10] "SL001800:TNFRSF1B"    "SL000466:IGFBP2"      "SL000525:MMP7"       
## [13] "SL007306:FAM3B"       "SL004557:CD59"        "SL011509:PYY"        
## [16] "SL008486:LGALS9"      "SL012561:REG4"        "SL004151:IL15RA"     
## [19] "SL003190:CCL15"       "SL003970:PTH"         "SL002506:PLAUR"      
## [22] "SL000024:F3"          "SL007274:EFNB2"       "SL011448:TNFRSF4"    
## [25] "SL016148:Human-virus" "SL011068:IL17RC"      "SL009207:DCTN2"      
## [28] "SL004458:PI3"         "SL003915:KLK8"        "SL008372:FLRT3"      
## [31] "SL003060:FGFR1"       "SL000658:GAS1"        "SL004140:EFNA4"      
## [34] "SL008099:CAPG"        "SL001888:SLPI"        "SL000002:VEGFA"      
## [37] "SL002654:EPHA2"       "SL001992:TNFRSF1A"    "SL005213:RELT"       
## [40] "SL008631:DSC2"        "SL003329:CCL14"      
## 
## $scRNAseq
## [1] "SL007207:THBS4"
# the 25 proteins of interest
attributes(v)$intersections[2:5]
## $`cdcs:scRNAseq`
## [1] "SL000695:LCN2"    "SL004149:IL13RA1" "SL001774:FABP3"   "SL001777:CST3"   
## [5] "SL000283:B2M"    
## 
## $`cdcs:immaculate`
##  [1] "SL007696:CD93"   "SL006527:EFEMP1" "SL009400:CHRDL1" "SL004646:LAYN"  
##  [5] "SL011888:SMOC1"  "SL005230:UNC5C"  "SL007056:BMP10"  "SL006924:AGRP"  
##  [9] "SL006119:TFF3"   "SL003869:GDF15"  "SL003320:FIGF"  
## 
## $`scRNAseq:immaculate`
## [1] "SL000306:NPPB"  "SL002505:NPPA"  "SL001761:TNNI3"
## 
## $`cdcs:scRNAseq:immaculate`
## [1] "SL007033:LTBP4"  "SL002785:NPPB"   "SL009324:FSTL3"  "SL007206:THBS2" 
## [5] "SL000052:TNNT2"  "SL001996:ANGPT2"
# DE in IMMACULATE
intersect(DEproteins_in_immaculate10, unlist(attributes(v)$intersections[2:5]))
## [1] "SL007033:LTBP4"  "SL007696:CD93"   "SL003320:FIGF"   "SL001996:ANGPT2"
## [5] "SL009324:FSTL3"  "SL000306:NPPB"   "SL002505:NPPA"

We calculated the enrichment of the 6 common proteins of all sets: we generated a 2x2 frequency matrix with the ‘proteins validated in all sets’ (n1), the ‘proteins validated in the two proteomics but not in the RNA-seq’ (n2), the ‘proteins not validated in both proteomics but validated in RNA-seq’ (n3) and the ‘proteins not validated anywhere’ (n4). The results show a significant enrichment with Fisher P-value = 2.25e-4.

# the 6 proteins of interest
n1 <- length(unlist(attributes(v)$intersections[5]))

# only found in proteomics
n2 <- length(unlist(attributes(v)$intersections[c(1, 3, 6)]))

# only found in RNA-seq and one of the two proteomics
n3 <- length(unlist(attributes(v)$intersections[c(2, 4, 7)]))

# found nowhere 
n4 <- length(setdiff(rownames(meta)[meta$Validated == FALSE], unlist(attributes(v)$intersections)))

# enrichment
mat <- matrix(c(n1, n2, n3, n4), ncol = 2)
colnames(mat) <- c("Proteomics","Not-proteomics")
rownames(mat) <- c("RNA","Not-RNA")
mat
##         Proteomics Not-proteomics
## RNA              6              9
## Not-RNA         68           1024
fisher.test(mat)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value = 0.0002256
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   2.839098 32.480616
## sample estimates:
## odds ratio 
##   9.989364

At the last step of our prioritisation procedure, we examine which of the 25 proteins (found in all sets or at least one of the proteomics and the single-cell RNA-seq) are directly connected to at least one of the DiNA proteins:

# the 25 validated proteins
validated25 <- unlist(attributes(v)$intersections[2:5])

# the DiNA proteins with significant LVEF association
leg1_DiNA
##  [1] "SL004141:EFNA5"    "SL001800:TNFRSF1B" "SL004557:CD59"    
##  [4] "SL004151:IL15RA"   "SL004140:EFNA4"    "SL006119:TFF3"    
##  [7] "SL001777:CST3"     "SL002654:EPHA2"    "SL001992:TNFRSF1A"
## [10] "SL000283:B2M"      "SL005213:RELT"     "SL008631:DSC2"
# the WGCNA edges of HF1
HFedges <- read.table("ProteomicsData/CytoscapeInput_HF_edges_blue.txt", sep = "\t", header = T)

# the final prioritised protein set
finalSet <- integrated_prioritisation(edges = HFedges,
                                      dina = leg1_DiNA,
                                      validated = validated25)
finalSet
##            cdcs:scRNAseq4            cdcs:scRNAseq5          cdcs:immaculate9 
##           "SL001777:CST3"            "SL000283:B2M"           "SL006119:TFF3" 
## cdcs:scRNAseq:immaculate3            cdcs:scRNAseq1          cdcs:immaculate4 
##          "SL009324:FSTL3"           "SL000695:LCN2"           "SL004646:LAYN" 
##          cdcs:immaculate8          cdcs:immaculate1         cdcs:immaculate10 
##           "SL006924:AGRP"           "SL007696:CD93"          "SL003869:GDF15" 
##         cdcs:immaculate11          cdcs:immaculate5            cdcs:scRNAseq2 
##           "SL003320:FIGF"          "SL011888:SMOC1"        "SL004149:IL13RA1" 
##          cdcs:immaculate7          cdcs:immaculate3 cdcs:scRNAseq:immaculate1 
##          "SL007056:BMP10"         "SL009400:CHRDL1"          "SL007033:LTBP4" 
## cdcs:scRNAseq:immaculate6 cdcs:scRNAseq:immaculate4          cdcs:immaculate6 
##         "SL001996:ANGPT2"          "SL007206:THBS2"          "SL005230:UNC5C" 
##            cdcs:scRNAseq3          cdcs:immaculate2      scRNAseq:immaculate1 
##          "SL001774:FABP3"         "SL006527:EFEMP1"           "SL000306:NPPB" 
## cdcs:scRNAseq:immaculate5 
##          "SL000052:TNNT2"

We found 22 proteins which form our final prioritised set. TFF3, CST3 and B2M are among those at the top of the list being cross-cohort validated (differentially expressed in CDCS and validated in IMMACULATE), differentially co-expressed in CDCS and significantly associated to LVEF. This final 22-protein set includes 5 of the 6 proteins were found in all datasets:

# cross-cohort validated, DiNA and LVEF associated 
intersect(leg1_DiNA, finalSet)
## [1] "SL006119:TFF3" "SL001777:CST3" "SL000283:B2M"
# proteins found in all datasets
finalSet[grep("cdcs:scRNAseq:immaculate", names(finalSet))]
## cdcs:scRNAseq:immaculate3 cdcs:scRNAseq:immaculate1 cdcs:scRNAseq:immaculate6 
##          "SL009324:FSTL3"          "SL007033:LTBP4"         "SL001996:ANGPT2" 
## cdcs:scRNAseq:immaculate4 cdcs:scRNAseq:immaculate5 
##          "SL007206:THBS2"          "SL000052:TNNT2"